anystyle 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (82) hide show
  1. checksums.yaml +7 -0
  2. data/HISTORY.md +78 -0
  3. data/LICENSE +27 -0
  4. data/README.md +103 -0
  5. data/lib/anystyle.rb +71 -0
  6. data/lib/anystyle/dictionary.rb +132 -0
  7. data/lib/anystyle/dictionary/gdbm.rb +52 -0
  8. data/lib/anystyle/dictionary/lmdb.rb +67 -0
  9. data/lib/anystyle/dictionary/marshal.rb +27 -0
  10. data/lib/anystyle/dictionary/redis.rb +55 -0
  11. data/lib/anystyle/document.rb +264 -0
  12. data/lib/anystyle/errors.rb +14 -0
  13. data/lib/anystyle/feature.rb +27 -0
  14. data/lib/anystyle/feature/affix.rb +43 -0
  15. data/lib/anystyle/feature/brackets.rb +32 -0
  16. data/lib/anystyle/feature/canonical.rb +13 -0
  17. data/lib/anystyle/feature/caps.rb +20 -0
  18. data/lib/anystyle/feature/category.rb +70 -0
  19. data/lib/anystyle/feature/dictionary.rb +16 -0
  20. data/lib/anystyle/feature/indent.rb +16 -0
  21. data/lib/anystyle/feature/keyword.rb +52 -0
  22. data/lib/anystyle/feature/line.rb +39 -0
  23. data/lib/anystyle/feature/locator.rb +18 -0
  24. data/lib/anystyle/feature/number.rb +39 -0
  25. data/lib/anystyle/feature/position.rb +28 -0
  26. data/lib/anystyle/feature/punctuation.rb +22 -0
  27. data/lib/anystyle/feature/quotes.rb +20 -0
  28. data/lib/anystyle/feature/ref.rb +21 -0
  29. data/lib/anystyle/feature/terminal.rb +19 -0
  30. data/lib/anystyle/feature/words.rb +74 -0
  31. data/lib/anystyle/finder.rb +94 -0
  32. data/lib/anystyle/format/bibtex.rb +63 -0
  33. data/lib/anystyle/format/csl.rb +28 -0
  34. data/lib/anystyle/normalizer.rb +65 -0
  35. data/lib/anystyle/normalizer/brackets.rb +13 -0
  36. data/lib/anystyle/normalizer/container.rb +13 -0
  37. data/lib/anystyle/normalizer/date.rb +109 -0
  38. data/lib/anystyle/normalizer/edition.rb +16 -0
  39. data/lib/anystyle/normalizer/journal.rb +14 -0
  40. data/lib/anystyle/normalizer/locale.rb +30 -0
  41. data/lib/anystyle/normalizer/location.rb +24 -0
  42. data/lib/anystyle/normalizer/locator.rb +22 -0
  43. data/lib/anystyle/normalizer/names.rb +88 -0
  44. data/lib/anystyle/normalizer/page.rb +29 -0
  45. data/lib/anystyle/normalizer/publisher.rb +18 -0
  46. data/lib/anystyle/normalizer/pubmed.rb +18 -0
  47. data/lib/anystyle/normalizer/punctuation.rb +23 -0
  48. data/lib/anystyle/normalizer/quotes.rb +14 -0
  49. data/lib/anystyle/normalizer/type.rb +54 -0
  50. data/lib/anystyle/normalizer/volume.rb +26 -0
  51. data/lib/anystyle/parser.rb +199 -0
  52. data/lib/anystyle/support.rb +4 -0
  53. data/lib/anystyle/support/finder.mod +3234 -0
  54. data/lib/anystyle/support/finder.txt +75 -0
  55. data/lib/anystyle/support/parser.mod +15025 -0
  56. data/lib/anystyle/support/parser.txt +75 -0
  57. data/lib/anystyle/utils.rb +70 -0
  58. data/lib/anystyle/version.rb +3 -0
  59. data/res/finder/bb132pr2055.ttx +6803 -0
  60. data/res/finder/bb550sh8053.ttx +18660 -0
  61. data/res/finder/bb599nz4341.ttx +2957 -0
  62. data/res/finder/bb725rt6501.ttx +15276 -0
  63. data/res/finder/bc605xz1554.ttx +18815 -0
  64. data/res/finder/bd040gx5718.ttx +4271 -0
  65. data/res/finder/bd413nt2715.ttx +4956 -0
  66. data/res/finder/bd466fq0394.ttx +6100 -0
  67. data/res/finder/bf668vw2021.ttx +3578 -0
  68. data/res/finder/bg495cx0468.ttx +7267 -0
  69. data/res/finder/bg599vt3743.ttx +6752 -0
  70. data/res/finder/bg608dx2253.ttx +4094 -0
  71. data/res/finder/bh410qk3771.ttx +8785 -0
  72. data/res/finder/bh989ww6442.ttx +17204 -0
  73. data/res/finder/bj581pc8202.ttx +2719 -0
  74. data/res/parser/bad.xml +5199 -0
  75. data/res/parser/core.xml +7924 -0
  76. data/res/parser/gold.xml +2707 -0
  77. data/res/parser/good.xml +34281 -0
  78. data/res/parser/stanford-books.xml +2280 -0
  79. data/res/parser/stanford-diss.xml +726 -0
  80. data/res/parser/stanford-theses.xml +4684 -0
  81. data/res/parser/ugly.xml +33246 -0
  82. metadata +195 -0
@@ -0,0 +1,3578 @@
1
+ title | A LIGHT-WEIGHT 3-D INDOOR ACQUISITION SYSTEM
2
+ | USING AN RGB-D CAMERA
3
+ blank |
4
+ |
5
+ |
6
+ |
7
+ title | A DISSERTATION
8
+ | SUBMITTED TO THE DEPARTMENT OF ELECTRICAL
9
+ | ENGINEERING
10
+ | AND THE COMMITTEE ON GRADUATE STUDIES
11
+ | OF STANFORD UNIVERSITY
12
+ | IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
13
+ | FOR THE DEGREE OF
14
+ | DOCTOR OF PHILOSOPHY
15
+ blank |
16
+ |
17
+ |
18
+ |
19
+ text | Young Min Kim
20
+ | August 2013
21
+ | © 2013 by Young Min Kim. All Rights Reserved.
22
+ | Re-distributed by Stanford University under license with the author.
23
+ blank |
24
+ |
25
+ |
26
+ text | This work is licensed under a Creative Commons Attribution-
27
+ | Noncommercial 3.0 United States License.
28
+ | http://creativecommons.org/licenses/by-nc/3.0/us/
29
+ blank |
30
+ |
31
+ |
32
+ |
33
+ text | This dissertation is online at: http://purl.stanford.edu/bf668vw2021
34
+ blank |
35
+ text | Includes supplemental files:
36
+ | 1. Video for Chapter 4 (video_final_medium3.wmv)
37
+ | 2. Video for Chapter 2 (Reconstruct.mpg)
38
+ blank |
39
+ |
40
+ |
41
+ |
42
+ meta | ii
43
+ text | I certify that I have read this dissertation and that, in my opinion, it is fully adequate
44
+ | in scope and quality as a dissertation for the degree of Doctor of Philosophy.
45
+ blank |
46
+ text | Leonidas Guibas, Primary Adviser
47
+ blank |
48
+ |
49
+ |
50
+ text | I certify that I have read this dissertation and that, in my opinion, it is fully adequate
51
+ | in scope and quality as a dissertation for the degree of Doctor of Philosophy.
52
+ blank |
53
+ text | Bernd Girod
54
+ blank |
55
+ |
56
+ |
57
+ text | I certify that I have read this dissertation and that, in my opinion, it is fully adequate
58
+ | in scope and quality as a dissertation for the degree of Doctor of Philosophy.
59
+ blank |
60
+ text | Sebastian Thrun
61
+ blank |
62
+ |
63
+ |
64
+ |
65
+ text | Approved for the Stanford University Committee on Graduate Studies.
66
+ | Patricia J. Gumport, Vice Provost for Graduate Education
67
+ blank |
68
+ |
69
+ |
70
+ |
71
+ text | This signature page was generated electronically upon submission of this dissertation in
72
+ | electronic format. An original signed hard copy of the signature page is on file in
73
+ | University Archives.
74
+ blank |
75
+ |
76
+ |
77
+ |
78
+ meta | iii
79
+ title | Abstract
80
+ blank |
81
+ text | Large-scale acquisition of exterior urban environments is by now a well-established
82
+ | technology, supporting many applications in map searching, navigation, and com-
83
+ | merce. The same is, however, not the case for indoor environments, where access is
84
+ | often restricted and the spaces can be cluttered. Recent advances in real-time 3D
85
+ | acquisition devices (e.g., Microsoft Kinect) enable everyday users to scan complex
86
+ | indoor environments at a video rate. Raw scans, however, are often noisy, incom-
87
+ | plete, and significantly corrupted, making semantic scene understanding difficult, if
88
+ | not impossible. In this dissertation, we present ways of utilizing prior information
89
+ | to semantically understand the environments from the noisy scans of real-time 3-D
90
+ | sensors. The presented pipelines are light-weighted, and have the potential to allow
91
+ | users to provide feedback at interactive rates.
92
+ | We first present a hand-held system for real-time, interactive acquisition of res-
93
+ | idential floor plans. The system integrates a commodity range camera, a micro-
94
+ | projector, and a button interface for user input and allows the user to freely move
95
+ | through a building to capture its important architectural elements. The system uses
96
+ | the Manhattan world assumption, which posits that wall layouts are rectilinear. This
97
+ | assumption allows generation of floor plans in real time, enabling the operator to
98
+ | interactively guide the reconstruction process and to resolve structural ambiguities
99
+ | and errors during the acquisition. The interactive component aids users with no ar-
100
+ | chitectural training in acquiring wall layouts for their residences. We show a number
101
+ | of residential floor plans reconstructed with the system.
102
+ | We then discuss how we exploit the fact that public environments typically contain
103
+ | a high density of repeated objects (e.g., tables, chairs, monitors, etc.) in regular or
104
+ blank |
105
+ |
106
+ meta | iv
107
+ text | non-regular arrangements with significant pose variations and articulations. We use
108
+ | the special structure of indoor environments to accelerate their 3D acquisition and
109
+ | recognition. Our approach consists of two phases: (i) a learning phase wherein we
110
+ | acquire 3D models of frequently occurring objects and capture their variability modes
111
+ | from only a few scans, and (ii) a recognition phase wherein from a single scan of a
112
+ | new area, we identify previously seen objects but in different poses and locations at
113
+ | an average recognition time of 200ms/model. We evaluate the robustness and limits
114
+ | of the proposed recognition system using a range of synthetic and real-world scans
115
+ | under challenging settings.
116
+ | Last, we present a guided real-time scanning setup, wherein the incoming 3D
117
+ | data stream is continuously analyzed, and the data quality is automatically assessed.
118
+ | While the user is scanning an object, the proposed system discovers and highlights
119
+ | the missing parts, thus guiding the operator (or the autonomous robot) to ’‘where
120
+ | to scan next”. We assess the data quality and completeness of the 3D scan data
121
+ | by comparing to a large collection of commonly occurring indoor man-made objects
122
+ | using an efficient, robust, and effective scan descriptor. We have tested the system
123
+ | on a large number of simulated and real setups, and found the guided interface to be
124
+ | effective even in cluttered and complex indoor environments. Overall, the research
125
+ | presented in the dissertation discusses how low-quality 3-D scans can be effectively
126
+ | used to understand indoor environments and allow necessary user-interaction in real-
127
+ | time. The presented pipelines are designed to be quick and effective by utilizing
128
+ | different geometric priors depending on the target applications.
129
+ blank |
130
+ |
131
+ |
132
+ |
133
+ meta | v
134
+ title | Acknowledgements
135
+ blank |
136
+ text | All the work presented in this thesis would not have been possible without help from
137
+ | many people.
138
+ | First of all, I would like to express my sincerest gratitude to my advisor, Leonidas
139
+ | Guibas. He is not only an intelligent and inspiring scholar in amazingly diverse
140
+ | topics, but also a very caring advisor with deep insights into various aspects of life.
141
+ | He guided me through one of the toughest times of my life, and I am lucky to be one
142
+ | of his students.
143
+ | During my life at Stanford, I had the privilege of working with the smartest people
144
+ | in the world learning not only about research, but also about the different mind-sets
145
+ | that lead to successful careers. I would like to thank Bernd Girod, Christian Theobalt,
146
+ | Sebastian Thrun, Vladlen Koltun, Niloy Mitra, Saumitra Das, Stephen Gould, and
147
+ | Adrian Butscher for being mentors during different stages of my graduate career. I
148
+ | also appreciate help of wonderful collaborators on exciting projects: Jana Kosecka,
149
+ | Branislav Miscusik, James Diebel, Mike Sokolsky, Jen Dolson, Dongming Yan, and
150
+ | Qixing Huang.
151
+ | The work presented here was generously supported by the following funding
152
+ | sources: Samsung Scholarship, MPC-VCC, Qualcomm corporation.
153
+ | I adore my officemates for being cheerful and encouraging, and most of all, being
154
+ | there: Derek Chan, Rahul Biswas, Stephanie Lefevre, Qixing Huang, Jonathan Jiang,
155
+ | Art Tevs, Michael Kerber, Justin Solomon, Jonathan Huang, Fan Wang, Daniel Chen,
156
+ | Kyle Heath, Vangelis Kalogerakis, and Sharath Kumar Raghvendra. I often spent
157
+ | more time with them than with any other people.
158
+ | I have to thank all the friends I met at Stanford. In particular, I would like to
159
+ blank |
160
+ |
161
+ meta | vi
162
+ text | thank Stephanie Kwan, Karen Zhu, Landry Huet, and Yiting Yeh for fun hangouts
163
+ | and random conversations in my early years. I was also fortunate enough to meet a
164
+ | wonderful chamber music group led by Dr. Herbert Myers in which I could play early
165
+ | music with Michael Peterson and Lisa Silverman. I also appreciate for being able to
166
+ | participate in a wonderful WISE (Women in Science and Engineering) group. WISE
167
+ | girls have always been smart, tender and supportive. Many Korean friends at Stanford
168
+ | were like family for me here. I will not attempt to name them all, but I would like to
169
+ | especially thank Jeongha Park, Soogine Chong, Sun-Hae Hong, Jenny Lee, Ga-Young
170
+ | Suh, Joyce Lee, Hyeji Kim, Sun Goo Lee, Wookyung Kim, Han Ho Song and Su-In
171
+ | Lee. While I was enjoying my life at Stanford, I was always connected to my friends
172
+ | in Korea. I would like to express my thanks for their trust and everlasting friendship.
173
+ | Last, I cannot thank to my family enough. I would like to dedicate my thesis to my
174
+ | parents, Kwang Woo Kim and Mi Ja Lee. Their constant love and trust have helped
175
+ | me overcome hardships ever since I was born. I also enjoyed having my brother, Joo
176
+ | Hwan Kim, in the Bay Area. His passion and thoughtful advice always helped me
177
+ | and cheered me up. I thank my husband, Sung-Boem Park, for being by my side no
178
+ | matter what happened. He is my best friend, and he made me face and overcome
179
+ | challenges. I also need to thank my soon-to-be born son (due in August), for allowing
180
+ | me to accelerate the last stages of my Ph. D.
181
+ | Thank you all for making me who I am today.
182
+ blank |
183
+ |
184
+ |
185
+ |
186
+ meta | vii
187
+ title | Contents
188
+ blank |
189
+ text | Abstract iv
190
+ blank |
191
+ text | Acknowledgements vi
192
+ blank |
193
+ text | 1 Introduction 1
194
+ | 1.1 Background on RGB-D Cameras . . . . . . . . . . . . . . . . . . . . 3
195
+ | 1.1.1 Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
196
+ | 1.1.2 Noise Characteristics . . . . . . . . . . . . . . . . . . . . . . . 5
197
+ | 1.2 3-D Indoor Acquisition System . . . . . . . . . . . . . . . . . . . . . 6
198
+ | 1.3 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 7
199
+ | 1.3.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
200
+ blank |
201
+ text | 2 Interactive Acquisition of Residential Floor Plans1 11
202
+ | 2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
203
+ | 2.2 System Overview and Usage . . . . . . . . . . . . . . . . . . . . . . . 14
204
+ | 2.3 Data Acquisition Process . . . . . . . . . . . . . . . . . . . . . . . . . 16
205
+ | 2.3.1 Pair-Wise Registration . . . . . . . . . . . . . . . . . . . . . . 19
206
+ | 2.3.2 Plane Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 22
207
+ | 2.3.3 Global Adjustment . . . . . . . . . . . . . . . . . . . . . . . . 23
208
+ | 2.3.4 Map Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
209
+ | 2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
210
+ | 2.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . 29
211
+ blank |
212
+ |
213
+ |
214
+ |
215
+ meta | viii
216
+ text | 3 Environments with Variability and Repetition 33
217
+ | 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
218
+ | 3.1.1 Scanning Technology . . . . . . . . . . . . . . . . . . . . . . . 35
219
+ | 3.1.2 Geometric Priors for Objects . . . . . . . . . . . . . . . . . . . 35
220
+ | 3.1.3 Scene Understanding . . . . . . . . . . . . . . . . . . . . . . . 36
221
+ | 3.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
222
+ | 3.2.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
223
+ | 3.2.2 Hierarchical Structure . . . . . . . . . . . . . . . . . . . . . . 40
224
+ | 3.3 Learning Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
225
+ | 3.3.1 Initializing the Skeleton of the Model . . . . . . . . . . . . . . 43
226
+ | 3.3.2 Incrementally Completing a Coherent Model . . . . . . . . . . 45
227
+ | 3.4 Recognition Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
228
+ | 3.4.1 Initial Assignment for Parts . . . . . . . . . . . . . . . . . . . 47
229
+ | 3.4.2 Refined Assignment with Geometry . . . . . . . . . . . . . . . 49
230
+ | 3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
231
+ | 3.5.1 Synthetic Scenes . . . . . . . . . . . . . . . . . . . . . . . . . 51
232
+ | 3.5.2 Real-World Scenes . . . . . . . . . . . . . . . . . . . . . . . . 54
233
+ | 3.5.3 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
234
+ | 3.5.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
235
+ | 3.5.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
236
+ | 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
237
+ blank |
238
+ text | 4 Guided Real-Time Scanning 64
239
+ | 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
240
+ | 4.1.1 Interactive Acquisition . . . . . . . . . . . . . . . . . . . . . . 67
241
+ | 4.1.2 Scan Completion . . . . . . . . . . . . . . . . . . . . . . . . . 67
242
+ | 4.1.3 Part-Based Modeling . . . . . . . . . . . . . . . . . . . . . . . 67
243
+ | 4.1.4 Template-Based Completion . . . . . . . . . . . . . . . . . . . 68
244
+ | 4.1.5 Shape Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . 68
245
+ | 4.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
246
+ | 4.2.1 Scan Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 70
247
+ blank |
248
+ |
249
+ meta | ix
250
+ text | 4.2.2 Shape Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 70
251
+ | 4.2.3 Scan Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 71
252
+ | 4.3 Partial Shape Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 71
253
+ | 4.3.1 View-Dependent Simulated Scans . . . . . . . . . . . . . . . . 72
254
+ | 4.3.2 A2h Scan Descriptor . . . . . . . . . . . . . . . . . . . . . . . 73
255
+ | 4.3.3 Descriptor-Based Shape Matching . . . . . . . . . . . . . . . . 74
256
+ | 4.3.4 Scan Registration . . . . . . . . . . . . . . . . . . . . . . . . . 75
257
+ | 4.4 Interface Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
258
+ | 4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
259
+ | 4.5.1 Model Database . . . . . . . . . . . . . . . . . . . . . . . . . . 76
260
+ | 4.5.2 Retrieval Results with Simulated Data . . . . . . . . . . . . . 77
261
+ | 4.5.3 Retrieval Results with Real Data . . . . . . . . . . . . . . . . 78
262
+ | 4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
263
+ blank |
264
+ text | 5 Conclusions 89
265
+ blank |
266
+ text | Bibliography 91
267
+ blank |
268
+ |
269
+ |
270
+ |
271
+ meta | x
272
+ title | List of Tables
273
+ blank |
274
+ text | 2.1 Accuracy comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 29
275
+ blank |
276
+ text | 3.1 Parameters used in our algorithm . . . . . . . . . . . . . . . . . . . . 41
277
+ | 3.2 Models obtained from the learning phase . . . . . . . . . . . . . . . . 55
278
+ | 3.3 Statistics for the recognition phase . . . . . . . . . . . . . . . . . . . 56
279
+ | 3.4 Statistics between objects learned for each scene category . . . . . . . 59
280
+ blank |
281
+ text | 4.1 Database and scan statistics . . . . . . . . . . . . . . . . . . . . . . . 76
282
+ blank |
283
+ |
284
+ |
285
+ |
286
+ meta | xi
287
+ title | List of Figures
288
+ blank |
289
+ text | 1.1 Triangulation principle . . . . . . . . . . . . . . . . . . . . . . . . . . 4
290
+ | 1.2 Kinect sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
291
+ blank |
292
+ text | 2.1 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
293
+ | 2.2 System pipeline and usage . . . . . . . . . . . . . . . . . . . . . . . . 15
294
+ | 2.3 Notation and representation . . . . . . . . . . . . . . . . . . . . . . . 17
295
+ | 2.4 Illustration for pair-wise registration . . . . . . . . . . . . . . . . . . 19
296
+ | 2.5 Optical flow and image plane correspondence . . . . . . . . . . . . . . 20
297
+ | 2.6 Silhouette points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
298
+ | 2.7 Optimizing the map . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
299
+ | 2.8 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
300
+ | 2.9 Analysis on computational time . . . . . . . . . . . . . . . . . . . . . 27
301
+ | 2.10 Visual comparisons of the generated floor plans . . . . . . . . . . . . 31
302
+ | 2.11 An possible example of extensions . . . . . . . . . . . . . . . . . . . . 32
303
+ blank |
304
+ text | 3.1 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
305
+ | 3.2 Acquisition pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
306
+ | 3.3 Hierarchical data structure. . . . . . . . . . . . . . . . . . . . . . . . 39
307
+ | 3.4 Overview of the learning phase . . . . . . . . . . . . . . . . . . . . . 42
308
+ | 3.5 Attachment of the model . . . . . . . . . . . . . . . . . . . . . . . . . 46
309
+ | 3.6 Overview of the recognition phase . . . . . . . . . . . . . . . . . . . . 47
310
+ | 3.7 Refining the segmentation . . . . . . . . . . . . . . . . . . . . . . . . 50
311
+ | 3.8 Recognition results on synthetic scans of virtual scenes . . . . . . . . 52
312
+ | 3.9 Chair models used in synthetic scenes . . . . . . . . . . . . . . . . . . 53
313
+ blank |
314
+ |
315
+ meta | xii
316
+ text | 3.10 Precision-recall curve . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
317
+ | 3.11 Various models learned/used in our test . . . . . . . . . . . . . . . . 55
318
+ | 3.12 Recognition results for various office and auditorium scenes . . . . . . 61
319
+ | 3.13 A close-up office scene . . . . . . . . . . . . . . . . . . . . . . . . . . 62
320
+ | 3.14 Comparison with an indoor labeling system . . . . . . . . . . . . . . 63
321
+ blank |
322
+ text | 4.1 A real-time guided scanning system . . . . . . . . . . . . . . . . . . . 65
323
+ | 4.2 Pipeline of the real-time guided scanning framework . . . . . . . . . . 69
324
+ | 4.3 Representative shape retrieval results . . . . . . . . . . . . . . . . . . 80
325
+ | 4.4 The proposed guided real-time scanning setup . . . . . . . . . . . . . 81
326
+ | 4.5 Retrieval results with simulated data using a chair data set . . . . . . 82
327
+ | 4.6 Retrieval results with simulated data using a couch data set . . . . . 83
328
+ | 4.7 Retrieval results with simulated data using a lamp data set . . . . . . 84
329
+ | 4.8 Retrieval results with simulated data using a table data set . . . . . . 85
330
+ | 4.9 Comparison between retrieval with view-dependant and merged scans 86
331
+ | 4.10 Effect of density-aware sampling . . . . . . . . . . . . . . . . . . . . . 87
332
+ | 4.11 Effect of noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
333
+ | 4.12 Real-time retrieval results on various datasets . . . . . . . . . . . . . 88
334
+ blank |
335
+ |
336
+ |
337
+ |
338
+ meta | xiii
339
+ | Chapter 1
340
+ blank |
341
+ title | Introduction
342
+ blank |
343
+ text | Acquiring a 3-D model of a real-world object, also known as 3-D reconstruction
344
+ | technology, has long been a challenge for various applications, including robotics
345
+ | navigation, 3-D modeling of virtual worlds, augmented reality, computer graphics,
346
+ | and manufacturing. In the graphics community, a 3-D model is typically acquired in a
347
+ | carefully calibrated set-up with highly accurate laser scans, followed by a complicated
348
+ | off-line process from scan registration to surface reconstruction. Because this is a very
349
+ | long process that requires special equipment, only a limited number of objects can be
350
+ | modeled, and the method cannot be scaled to larger environments.
351
+ | One of the most common applications of a large-scale 3-D reconstruction comes
352
+ | from modeling of urban environments. To build a model, a vehicle equipped with
353
+ | different sensors drives along roads and collects a large amount of data from lasers,
354
+ | GPS signals, wheel counters, cameras, etc. The data is then processed and stored in a
355
+ | compact form which includes important roads, buildings, parking lots. The mapped
356
+ | environments are used frequently in cell-phone applications, mapping technology or
357
+ | navigation tools.
358
+ | However, we cannot simply extend the same technology used in the 3-D reconstruc-
359
+ | tion of urban environments to indoor environments. First, unlike urban environments,
360
+ | where permanent roads exist, there are no clearly defined pathways that people must
361
+ | follow in an indoor environment. Occupants walk in various patterns around an in-
362
+ | door area, and often the space is cluttered, which could result in safety issues if, say,
363
+ blank |
364
+ |
365
+ meta | 1
366
+ | CHAPTER 1. INTRODUCTION 2
367
+ blank |
368
+ |
369
+ |
370
+ text | a robot with sensors drives within the area. Second, an indoor environment is not
371
+ | static. As residents and workers of the building engage in daily activities in interior
372
+ | environments, many objects are moved around or disappear, and new objects can be
373
+ | introduced. Third, interior shapes are much more complex compared to the outdoor
374
+ | surfaces of buildings, and it cannot simply be assumed that the objects present in a
375
+ | space are composed of flat surfaces as is generally the case in outdoor urban settings.
376
+ | Last, the modality of sensors used for outdoor mapping is not suitable for interior
377
+ | mapping and needs to be changed. A GPS signal does not work in indoor environ-
378
+ | ments, and the lighting conditions can vary significantly from one space to another
379
+ | compared to relatively constant sunlight outdoors.
380
+ | Yet, 3-D reconstruction of indoor environments also have a variety of potential
381
+ | applications. After a 3-D model of an indoor environment is acquired, the model
382
+ | could be used for interior design, indoor navigation, surveillance, or understanding
383
+ | the interior layouts and existence of objects in a space. Depending on the applications
384
+ | for which the reconstructed model would be used, the distance range and level of detail
385
+ | needed can vary as well.
386
+ | Recently, real-time 3-D sensors, such as the RGB-D sensors, a light-weight com-
387
+ | modity device, have been specifically designed to function in indoor environments and
388
+ | used to provide real-time 3-D data. Although the data captured from these sensors
389
+ | suffer from a limited field of view and complex noise characteristics, and therefore
390
+ | might not be suitable for accurate 3-D reconstruction, it can be used for everyday
391
+ | users to easily capture and utilize 3-D information of indoor environments. The work
392
+ | presented in this dissertation uses the data captured from RGB-D cameras with the
393
+ | goal of providing a useful 3-D acquisition while overcoming the limitations of the
394
+ | captured data. To do this, we have assumed different geometric priors depending on
395
+ | the targeted applications.
396
+ | In the remainder of this chapter, we first describe the characteristics of RGB-
397
+ | D camera sensors (Section 1.1). The subsequent section (Section 1.2) presents our
398
+ | approach to acquire 3-D indoor environments. The chapter concludes with an outline
399
+ | of the remainder of the dissertation (Section 1.3).
400
+ meta | CHAPTER 1. INTRODUCTION 3
401
+ blank |
402
+ |
403
+ |
404
+ title | 1.1 Background on RGB-D Cameras
405
+ text | Building a 3-D model of actual objects enables the real world to be connected to a
406
+ | virtual world. After obtaining a digital model from a real-world object, the model can
407
+ | be used in various applications. A benefit of 3D modeling is that the digital object
408
+ | can be saved and altered freely without an actual space being damaged or destroyed.
409
+ | Until recently, it was not possible for non-expert users to capture real-world envi-
410
+ | ronments in 3D because of the complexity and cost of the required equipment. RGB-D
411
+ | cameras, which provide real-time depth and color information, only became available
412
+ | a few years ago. The pioneering commodity product is the X-Box Kinect [Mic10],
413
+ | launched on October 2011. Originally developed as a gaming device, the sensor pro-
414
+ | vides real-time depth streams enabling interaction between a user and a system.
415
+ | The Kinect is affordable and easy to operate for non-expert users, and the pro-
416
+ | duced data can be accessed through open-source drivers. Although the main purpose
417
+ | of the Kinect by far was motion-sensing, thus providing a real-time interface for gam-
418
+ | ing or control, the device has served many purposes and has been used as a tool to
419
+ | develop personalized applications with the help of the drivers. Some developers also
420
+ | use the device to extend computer vision-related tasks (such as object recognition
421
+ | or structure from motion) but with depth measurements augmented as an additional
422
+ | modality of input. In addition, the device can also be viewed as a 3-D sensor that
423
+ | produces 3-D pointcloud data. In our work, this is how we view the device, and the
424
+ | goal of the research presented here, as noted above, was to acquire 3-D indoor objects
425
+ | or environments using the RGB-D cameras of the Kinect sensor.
426
+ blank |
427
+ |
428
+ title | 1.1.1 Technology
429
+ text | The underlying core technology of the depth-capturing capacity of Kinect comes
430
+ | from its structured-light 3D scanner. This scanner measures the three-dimensional
431
+ | shape of an object using projected light patterns and a camera system. A typical
432
+ | scanner measuring assembly consists of one stripe projector and at least one camera.
433
+ | Projecting a narrow band of light onto a three-dimensionally shaped surface produces
434
+ | a line of illumination that appears distorted from perspectives other than that of the
435
+ meta | CHAPTER 1. INTRODUCTION 4
436
+ blank |
437
+ |
438
+ |
439
+ |
440
+ text | Figure 1.1: Triangulation principle shown by one of multiple stripes (image from
441
+ | http://en.wikipedia.org/wiki/File:1-stripesx7.svg)
442
+ blank |
443
+ text | projector, and this line can be used for an exact geometric reconstruction of the
444
+ | surface shape. A sample setup with the projected line pattern is shown in Figure 1.1.
445
+ | The displacement of the stripes can be converted into 3D coordinates, which allow
446
+ | any details on an object’s surface to be retrieved.
447
+ | An invisible structured-light scanner scans a 3-D shape of an object by projecting
448
+ | patterns with light in an invisible spectrum. The Kinect uses projecting patterns
449
+ | composed of points in infrared (IR) light to generate video data in 3D. As shown in
450
+ | Figure 1.2, the Kinect is a horizontal bar with an IR light emitter and IR sensor. The
451
+ | IR emitter emits infrared light beams, and the IR sensor reads the IR beams reflected
452
+ | back to the sensor. The reflected beams are converted into depth information that
453
+ | measures the distance between an object and the sensor. This makes capturing a
454
+ | depth image possible. The color sensor captures normal video (visible light) that is
455
+ | synchronized with the depth data. The horizontal bar of the Kinect also contains
456
+ | microphone arrays and is connected to a small base by a tilt motor. While the color
457
+ | video and microphone provide additional means for a natural user interface, in this
458
+ meta | CHAPTER 1. INTRODUCTION 5
459
+ blank |
460
+ |
461
+ |
462
+ |
463
+ text | Figure 1.2: Kinect sensor (left) and illustration of the integrated hardware (right).
464
+ | (images from http://i.msdn.microsoft.com/dynimg/IC568992.png and http://
465
+ | i.msdn.microsoft.com/dynimg/IC584396.png)
466
+ blank |
467
+ text | dissertation, we are focused on the depth-sensing capability of the device.
468
+ | The Kinect has a limited working range, mainly designed for the volume that a
469
+ | person will require while playing a game. Kinect’s official documentation1 suggests
470
+ | a working range from 0.8 m to 4 m from the sensor. The sensor has angular field
471
+ | of view of 57◦ horizontally and 43◦ vertically. When an object is out of range for
472
+ | a particular pixel, the system will return no values. The RGB video streams are
473
+ | produced in a 1280×960 resolution. However, the default RGB video stream uses 8-
474
+ | bit VGA resolution (640×480 pixels). The monochrome depth sensing video stream
475
+ | is also in VGA resolution with 11-bit depth, which provides 2,048 levels of sensitivity.
476
+ | The depth and color stream are produced at the frame rate of 30 Hz.
477
+ | The depth data is originally produced as a 2-D grid of raw depth values. The
478
+ | values in each pixel can then be converted into (x, y, z) coordinates with calibration
479
+ | data. Depending on the application, the developer can regard the 2-D grid of values
480
+ | as a depth image, or the scattered points in 3-D ((x, y, z) coordinates) as unstructured
481
+ | pointcloud data.
482
+ blank |
483
+ |
484
+ title | 1.1.2 Noise Characteristics
485
+ text | While RGB-D cameras can provide real-time depth information, the obtained mea-
486
+ | surements exhibit convoluted noise characteristics. The measurements are extracted
487
+ meta | 1
488
+ text | http://msdn.microsoft.com/en-us/library/jj131033.aspx
489
+ meta | CHAPTER 1. INTRODUCTION 6
490
+ blank |
491
+ |
492
+ |
493
+ text | from identification of corresponding points of infrared projections in image pixels,
494
+ | and there are multiple possible sources of errors: (i) calibration error both of the
495
+ | extrinsic calibration parameters, which are given as the displacement between the
496
+ | projector and cameras, and the intrinsic calibration parameters, which depend on
497
+ | the focal points and size of pixels on the sensor grid, vary for each product; (ii)
498
+ | distance-dependent quantization error – because the accuracy of measurements de-
499
+ | pends on the resolution of a pixel compared to the details of projected pattern on
500
+ | the measured object, measurements are more noisy for farther points with more se-
501
+ | vere quantization artifacts; (iii) error from ambiguous or poor projection, in which
502
+ | the cameras cannot clearly observe the projected patterns – as the measurements are
503
+ | made by identifying the projected location of the infrared pattern, the distortion of
504
+ | the projected patterns on depth boundaries or on reflective material can result in
505
+ | wrong measurements. Sometimes the system cannot locate the corresponding points
506
+ | due to occlusion by parallax, or distance range and the data is reported as missing.
507
+ | In short, the depth data exhibits highly non-linear noise characteristics, and it is very
508
+ | hard to model all of the noise analytically.
509
+ blank |
510
+ |
511
+ title | 1.2 3-D Indoor Acquisition System
512
+ text | Given the complex noise characteristics of RGB-D cameras, we assumed that the de-
513
+ | vice produces noisy pointcloud data. Instead of reverse-engineering and correcting the
514
+ | noise from each source, we overcame the limitation on data by imposing assumptions
515
+ | on the 3-D shape of the objects being scanned.
516
+ | There are three possible ways to reconstruct 3-D models from noisy data. The first
517
+ | is to overcome the limitation of data is accumulating multiple frames from slightly dif-
518
+ | ferent viewpoints [IKH+ 11]. By averaging the noise measurements and merging them
519
+ | into a single volumetric structure, a very high-quality mesh model can be recovered.
520
+ | The second is using a machine learning-based method. In this approach, multiple
521
+ | instances of measurements and actual object labels are first collected. Classifiers are
522
+ | then trained to produce the object labels given the measurements and later used to
523
+ | understand the given measurements. The third way is to assume geometric priors on
524
+ meta | CHAPTER 1. INTRODUCTION 7
525
+ blank |
526
+ |
527
+ |
528
+ text | the data being captured. Assuming that the underlying scene is not completely ran-
529
+ | dom, the shape to be reconstructed has a limited degree of freedom, and can thus be
530
+ | reconstructed by inferring the most probable shape within the scope of the assumed
531
+ | structure.
532
+ | This third way is the method used in our work. By focusing on acquiring the pre-
533
+ | defined modes or degree of freedom given the geometric priors, the acquired model
534
+ | naturally capture high-level information of the structure. In addition, the acquisition
535
+ | pipeline becomes lightweight and the entire process can stay real-time. Because the in-
536
+ | put data stream is also real-time, there is possibility of incorporating user-interaction
537
+ | during the capturing process.
538
+ blank |
539
+ |
540
+ title | 1.3 Outline of the Dissertation
541
+ text | The chapters to follow, outlined below, discuss in detail the specific approaches we
542
+ | took to mitigate the problems inherent in indoor reconstruction from noisy sensor
543
+ | data.
544
+ | Chapter 2 discusses a pipeline used to acquire floor plans in residential areas. The
545
+ | proposed system is quick and convenient compared to the common pipeline used to
546
+ | acquire floor plans from manual sketching and measurements, which are frequently
547
+ | required for remodeling or selling a property. We posit that the world is composed of
548
+ | relatively large, flat surfaces that meet at right angles. We focus on continuous collec-
549
+ | tion of points that occupy large, flat areas and align with the axes and ignoring other
550
+ | points. Even with very noisy data, the process can be performed at an interactive
551
+ | rate since the space of possible plane arrangements is sparse given the measurements.
552
+ | We take advantage of real-time data and allow users to provide intuitive feedback
553
+ | to assist the acquisition pipeline. The research described in the chapter was first
554
+ | published as Y.M. Kim, J. Dolson, M. Sokolsky, V. Koltun, S.Thrun, Interactive
555
+ | Acquisition of Residential Floor Plans, IEEE International Conference on Robotics
556
+ | and Animation (ICRA), 2012 c 2012 IEEE, and the contents were also replicated
557
+ | with small modifications.
558
+ meta | CHAPTER 1. INTRODUCTION 8
559
+ blank |
560
+ |
561
+ |
562
+ text | Chapter 3 discusses how we targeted public spaces with many repeating ob-
563
+ | jects in different poses or variation modes. Even though indoor environments can
564
+ | frequently change, we can identify patterns and possible movements by reasoning
565
+ | in object-level. Especially in public buildings (offices, cafeterias, auditoriums, and
566
+ | seminar rooms), chairs, tables, monitors, etc, are repeatedly used in similar pat-
567
+ | terns. We first build abstract models of the objects of interest with simple geometric
568
+ | primitives and deformation modes. We then use the built models to quickly de-
569
+ | tect the objects of interest within an indoor scene that the objects repeatedly ap-
570
+ | pear. While the models are simple approximation of actual complex geometry, we
571
+ | demonstrate that the models are sufficient to detect the object within noisy, par-
572
+ | tial indoor scene data. The learned variability modes not only factor out nuisance
573
+ | modes of variability (e.g., motions of chairs, etc.) from meaningful changes (e.g.,
574
+ | security, where the new scene objects should be flagged), but also provide the func-
575
+ | tional modes of the object (the status of open drawers, closed laptop, etc.), which
576
+ | potentially provide high-level understanding of the scene. The study discussed have
577
+ | first appeared as a publication, Young Min Kim, Niloy J. Mitra, Dong-Ming Yan,
578
+ | and Leonidas Guibas. 2012. Acquiring 3D indoor environments with variability and
579
+ | repetition. ACM Trans. Graph. 31, 6, Article 138 (November 2012), 11 pages.
580
+ | DOI=10.1145/2366145.2366157 http://doi.acm.org/10.1145/2366145.2366157, from
581
+ | which the major written parts of the chapter were adapted.
582
+ | Chapter 4 discusses a reconstruction approach that utilizes 3-D models down-
583
+ | loaded from the web to assist in understanding the objects being scanned. The data
584
+ | stream from an RGB-D camera is noisy and exhibit lots of missing data, making it
585
+ | very hard to accurately build a full model of an object being scanned. We take an
586
+ | approach to use a large database of 3-D models to match against partial, noisy scan
587
+ | of the input data stream. To this end, we propose a simple, efficient, yet discrimina-
588
+ | tive descriptor that can be evaluated in real-time and used to process complex indoor
589
+ | scenes. The matching models are quickly found from the database with help of our
590
+ | proposed shape descriptor. This also allows real-time assessment of the quality of the
591
+ | data captured, and the system provides the user with real-time feedback on where to
592
+ | scan. Eventually the user can retrieve the closest model as quickly as possible during
593
+ meta | CHAPTER 1. INTRODUCTION 9
594
+ blank |
595
+ |
596
+ |
597
+ text | the scanning session. The research and contents of the chapter will be published as
598
+ | Y.M. Kim, N. Mitra, Q. Huang, L. Guibas, Guided Real-Time Scanning of Indoor
599
+ | Environments, Pacific Graphics 2013.
600
+ | Chapter 5 concludes the dissertation with a summary of our work and a discussion
601
+ | of future directions this research could take.
602
+ blank |
603
+ |
604
+ title | 1.3.1 Contributions
605
+ text | The major contribution of the dissertation is to present methods to quickly acquire
606
+ | 3-D information from noisy, occluded pointcloud data by assuming geometric pri-
607
+ | ors. The pre-defined modes not only provide high-level understanding of the current
608
+ | mode, but also allow the data size to stay compact, which, in turn, saves memory
609
+ | and processing time. The proposed geometric priors have been previously used for
610
+ | different settings, but our approach incorporate the priors tuned for the practical
611
+ | tasks at hand with real scans from RGB-D data acquired from actual environments.
612
+ | The example geometric priors that are covered are as following:
613
+ blank |
614
+ text | • Based on Manhattan world assumption, important architectural elements (walls,
615
+ | floor and ceiling) can be retrieved in real-time.
616
+ blank |
617
+ text | • By building an abstract model composed of simple geometric primitives and joint
618
+ | information between primitives, objects under severe occlusion and different
619
+ | configuration can be located. The bottom-up approach can quickly populate
620
+ | large indoor environments with variability and repetition (around 200 ms per
621
+ | object).
622
+ blank |
623
+ text | • Online public database of 3-D models recover the structure of objects from
624
+ | partial, noisy scans in a matter of seconds. We developed a relation-based
625
+ | lightweight descriptor for fast and accurate model retrieval and positioning.
626
+ blank |
627
+ text | We also take an advantage of the representation and demonstrate quick and effi-
628
+ | cient pipeline, including user-interaction when possible. More specifically, we demon-
629
+ | strate following novel prototypes of systems:
630
+ meta | CHAPTER 1. INTRODUCTION 10
631
+ blank |
632
+ |
633
+ |
634
+ text | • A new hand-held system that a user can capture the space and automatically
635
+ | generate a floor plan. The user does not have to measure distances or manually
636
+ | sketch the layout.
637
+ blank |
638
+ text | • A projector attached to the RGB-D camera to communicate current status of
639
+ | the acquisition on the physical surface with user, and thus allow user to provide
640
+ | intuitive feedback.
641
+ blank |
642
+ text | • A real-time guided scanning setup for online quality assessment of streaming
643
+ | RGB-D data obtained with help of 3-D database of models.
644
+ blank |
645
+ text | While the specific geometric priors and prototypes listed above come from under-
646
+ | standing of the characteristic of the task at hand, the underlying assumptions and
647
+ | approach provide a direction to allow everyday user to acquire useful 3-D information
648
+ | in the years to come as real-time 3-D scans become available.
649
+ meta | Chapter 2
650
+ blank |
651
+ title | Interactive Acquisition of
652
+ | Residential Floor Plans1
653
+ blank |
654
+ text | Acquiring an accurate floor plan of a residence is a challenging task, yet one that
655
+ | is required for many situations, such as remodeling or sale of a property. Original
656
+ | blueprints can be difficult to find, especially for older residences. In practice, contrac-
657
+ | tors and interior designers use point-to-point laser measurement devices to acquire
658
+ | a set of distance measurements. Based on these measurements, an expert creates a
659
+ | floor plan that respects the measurements and represents the layout of the residence.
660
+ | Both taking measurements and representing the layout are cumbersome manual tasks
661
+ | that require experience and time.
662
+ | In this chapter, we present a hand-held system for indoor architectural reconstruc-
663
+ | tion. This system eliminates the manual post-processing necessary for reconstructing
664
+ | the layout of walls in a residence. Instead, an operator with no architectural exper-
665
+ | tise can interactively guide the reconstruction process by moving freely through an
666
+ meta | 1
667
+ text | The contents of the chapter was originally published as Y.M. Kim, J. Dolson, M. Sokolsky, V.
668
+ | Koltun, S.Thrun, Interactive Acquisition of Residential Floor Plans, IEEE International Conference
669
+ | on Robotics and Animation (ICRA), 2012 c 2012 IEEE.
670
+ | In reference to IEEE copyrighted material which is used with permission in this thesis, the
671
+ | IEEE does not endorse any of Stanford University’s products or services. Internal or personal
672
+ | use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material
673
+ | for advertising or promotional purposes or for creating new collective works for resale or redis-
674
+ | tribution, please go to http://www.ieee.org/publications_standards/publications/rights/
675
+ | rights_link.html to learn how to obtain a License from RightsLink.
676
+ blank |
677
+ |
678
+ meta | 11
679
+ | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 12
680
+ blank |
681
+ |
682
+ | 
683
+ |
684
+ |
685
+ |
686
+ |
687
+ | 
688
+ |
689
+ |
690
+ |
691
+ |
692
+ | 
693
+ |
694
+ |
695
+ |  
696
+ |
697
+ |
698
+ |
699
+ |
700
+ | 
701
+ |
702
+ |
703
+ |
704
+ | 
705
+ | 
706
+ |
707
+ |
708
+ |
709
+ |
710
+ | 
711
+ |
712
+ | 
713
+ |
714
+ |
715
+ | 
716
+ |
717
+ |
718
+ |
719
+ |
720
+ text | Figure 2.1: Our hand-held system is composed of a projector, a Microsoft Kinect
721
+ | sensor, and an input button (left). The system uses augmented reality feedback
722
+ | (middle left) to project the status of the current model onto the environment and to
723
+ | enable real-time acquisition of residential wall layouts (middle right). The floor plan
724
+ | (middle right) and visualization (right) were generated using data captured by our
725
+ | system.
726
+ blank |
727
+ text | interior with the hand-held system until all walls have been observed by the sensor
728
+ | in the system.
729
+ | Our system is composed of a laptop connected to an RGB-D camera, a lightweight
730
+ | optical projector, and an input button interface (Figure 2.1, left). The RGB-D cam-
731
+ | era is a real-time depth sensor that acts as the main input modality. As noted in
732
+ | Chapter 1, we use the Microsoft Kinect, a lightweight commodity device that out-
733
+ | puts VGA-resolution range and color images at video rates. The data is processed
734
+ | in real time to create the floor plan by focusing on large flat surfaces and ignoring
735
+ | clutter. The generated floor plan can be used directly for remodeling or real-estate
736
+ | applications or to produce a 3D model of the interior for applications in virtual envi-
737
+ | ronments. In Section 2.4, we present and discuss a number of residential wall layouts
738
+ | reconstructed with our system, captured from actual apartments. Even though the
739
+ | results presented here were obtained focus on residential spaces, the system can also
740
+ | be used in other types of interior environments.
741
+ | The attached projector is initially calibrated to have an overlapping field of view
742
+ | with the same image center as the depth sensor. It projects the reconstruction status
743
+ | onto the surface being scanned. Under normal lighting, the projector does not provide
744
+ | a sophisticated rendering. Rather, the projection allows the user to visualize the
745
+ | reconstruction process. The user can then detect reconstruction errors that arise due
746
+ | to deficiencies in the data capture path and can complete missing data in response.
747
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 13
748
+ blank |
749
+ |
750
+ |
751
+ text | The user can also note which walls have been included in the model and easily resolve
752
+ | ambiguities with a simple input device. The proposed system has advantages over
753
+ | other previous applications by allowing a new type of user interaction in real time that
754
+ | focuses only on architectural elements relevant to the task at hand. This difference
755
+ | is discussed in detail in the following section.
756
+ blank |
757
+ |
758
+ title | 2.1 Related Work
759
+ text | A number of approaches have been proposed for indoor reconstruction in computer
760
+ | graphics, computer vision, and robotics. Real-time indoor reconstruction using either
761
+ | a depth sensor [HKH+ 12] or an optical camera [ND10] has been recently explored.
762
+ | The results at these studies suggest that the key to real-time performance is the
763
+ | fast registration of successive frames. Similar to [HKH+ 12], we fuse both color and
764
+ | depth information to register frames. Furthermore, our approach extends real-time
765
+ | acquisition and reconstruction by allowing the operator to visualize the current re-
766
+ | construction status without consulting a computer screen. Because the feedback loop
767
+ | in our system is immediate, the operator can resolve failures and ambiguities while
768
+ | the acquisition session is in progress.
769
+ | Previous approaches have also been limited to a dense 3-D reconstruction (reg-
770
+ | istration of point cloud data) with no higher-level information, which is memory
771
+ | intensive. A few exceptions include [GCCMC08], by means of which high-level fea-
772
+ | tures (lines and planes) are detected to reduce complexity and noise. The high-level
773
+ | structures, however, do not necessarily correspond to actual architectural elements,
774
+ | such as walls, floors, or ceilings. In contrast, our system identifies and focuses on
775
+ | significant architectural elements using the Manhattan-world assumption, which is
776
+ | based on the observation that many indoor scenes are largely rectilinear [CY99]. This
777
+ | assumption is widely made for indoor scene reconstruction from images to overcome
778
+ | the inherent limitations of image data [FCSS09][VAB10]. While the traditional stereo
779
+ | method only reconstructs 3-D locations of image feature points, the Manhattan-world
780
+ | assumption successfully fills an area between the sparse feature points during post-
781
+ | processing. Our system, based on the Manhattan-world assumption, differentiates
782
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 14
783
+ blank |
784
+ |
785
+ |
786
+ text | between architectural features and miscellaneous objects in the space, producing a
787
+ | clean architectural floor plan and simplifying the representation of the environment.
788
+ | Even with the Manhattan-world assumption, however, the system still cannot fully
789
+ | resolve ambiguities introduced by large furniture items and irregular features in the
790
+ | space without user input. The interactive capability offered by our system allows the
791
+ | user to easily disambiguate the situation and integrate new input into a global map
792
+ | of the space in real time.
793
+ | Not only does our system simplify the representation of the feature of a space, but
794
+ | by doing so it reduces the computational burden of processing a map. Employing the
795
+ | Manhattan-world assumption simplifies the map construction to a one-dimensional,
796
+ | closed-form problem. Registration of successive point clouds results in an accumula-
797
+ | tion of errors, especially for a large environment, and requires a global optimization
798
+ | step in order to build a consistent map. This is similar to reconstruction tasks en-
799
+ | countered in robotic mapping. In other approaches, the problem is usually solved by
800
+ | bundle adjustment, a costly off-line process [TMHF00][Thr02].
801
+ | The augmented reality component of our system is inspired by the SixthSense
802
+ | project [MM09]. Instead of simply augmenting a user’s view of the world, however,
803
+ | our projected output serves to guide an interactive reconstruction process. Directing
804
+ | the user in this way is similar to re-photography [BAD10], where a user is guided
805
+ | to capture a photograph from the same viewpoint as in a previous photograph. By
806
+ | using a micro-projector as the output modality, our system allows the operator to
807
+ | focus on interacting with the environment.
808
+ blank |
809
+ |
810
+ title | 2.2 System Overview and Usage
811
+ text | The data acquisition process is initiated by the user pointing the sensor to a corner,
812
+ | where three mutually orthogonal planes meet. This corner defines the Manhattan-
813
+ | world coordinate system. The attached projector indicates successful initialization by
814
+ | overlaying blue-colored planes with white edges onto the scene (Figure 2.2 (a)). After
815
+ | the initialization, the user scans each room individually as he or she loops around in
816
+ | it holding the device. If the movement is too fast or if there are not enough features,
817
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 15
818
+ blank |
819
+ |
820
+ |
821
+ |
822
+ text | Fetch a new frame
823
+ blank |
824
+ text | Exists
825
+ | Global
826
+ | Success
827
+ | adjustment
828
+ | Pair-wise
829
+ | Initialization Plane extraction
830
+ | registration
831
+ blank |
832
+ text | Map update
833
+ | New
834
+ blank |
835
+ text | User interaction
836
+ | Failure Left click Right click
837
+ | Adjust data Start a new
838
+ | Visual feedback Select planes
839
+ | path room
840
+ blank |
841
+ |
842
+ |
843
+ |
844
+ text | (a) (b) (c)
845
+ blank |
846
+ |
847
+ text | Figure 2.2: System overview and usage. When an acquisition session is initiated by
848
+ | observing a corner, the user is notified by a blue projection (a). After the initial-
849
+ | ization, the system updates the camera pose by registering consecutive frames. If a
850
+ | registration failure occurs, the user is notified by a red projection and is required to
851
+ | adjust the data capture path (b). Otherwise, the updated camera configuration is
852
+ | used to detect planes that satisfy the Manhattan-world assumption in the environ-
853
+ | ment and to integrate them into the global map. The user interacts with the system
854
+ | by selecting planes in the space (c). When the acquisition session is completed, the
855
+ | acquired map is used to construct a floor plan consisting of user-selected planes.
856
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 16
857
+ blank |
858
+ |
859
+ |
860
+ text | a red projection on the surface guides the user to recover the position of the device
861
+ | (Figure 2.2 (b)) and re-acquire that area.
862
+ | The system extracts flat surfaces that align with the Manhattan coordinate system
863
+ | and creates complete rectilinear polygons, even when connectivity between planes is
864
+ | occluded. At times, the user might not want some of the extracted planes (parts
865
+ | of furniture or open doors) to be included in the model even if these planes satisfy
866
+ | the Manhattan-world assumption. In these cases, when the user clicks the input
867
+ | button (left click), the extracted wall toggles between inclusion (indicated in blue)
868
+ | and exclusion (indicated in grey) to the model (Figure 2.2 (c)). As the user finishes
869
+ | scanning a room, he or she can move to another room and scan it. A new rectilinear
870
+ | polygon is initiated by a right click. Another rectilinear polygon is similarly created
871
+ | by including the selected planes, and the room is correctly positioned into the global
872
+ | coordinate system. The model is updated in real time and stored in either a CAD
873
+ | format or a 3-D mesh format that can be loaded into most 3-D modeling software.
874
+ blank |
875
+ |
876
+ title | 2.3 Data Acquisition Process
877
+ text | Some notations used throughout the section are introduced in Figure 2.3. At each
878
+ | time step t, the sensor produces a new frame of data, Ft = {Xt , It }, composed
879
+ | of a range image Xt (a 2-D array of depth measurements) and a color image It ,
880
+ | Figure 2.3(a). T t represents the transformation from the frame Ft , measured from
881
+ | the current sensor position, to the global coordinate system, which is where the map
882
+ | Mt = {Ltr , Rtr } is defined, Figure 2.3(b). Throughout the data capture session, the
883
+ | system maintains the global map Mt , and the two most recent frames, Ft−1 and Ft
884
+ | to update the transformation information. Instead of storing information from all
885
+ | frames, the system keeps the total computational and memory requirements minimal
886
+ | by incrementally updating the global map only with components that need to be
887
+ | added to the final model. Additionally, the frame with the last observed corner Fc is
888
+ | stored to recover the sensor position when lost.
889
+ | After the transformation is found, the relationship between the planes in global
890
+ | map Mt and the measurement in the current frame Xt is represented as Pt , a 2-D
891
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 17
892
+ blank |
893
+ |
894
+ |
895
+ |
896
+ text | Xt
897
+ | P3
898
+ | P2
899
+ | P6
900
+ | t P4
901
+ | I P0 P5
902
+ | T t (F t )
903
+ | P7 P8
904
+ blank |
905
+ |
906
+ text | (a) F t (b) Ltr
907
+ | P3
908
+ | P2
909
+ blank |
910
+ text | P4
911
+ | P0 P5
912
+ | P3 P5
913
+ | P6
914
+ | P7
915
+ blank |
916
+ text | (c) P t (d) R tr
917
+ blank |
918
+ text | Figure 2.3: Notation and representation. Each frame of the sensor Ft is composed of
919
+ | a 2-D array of depth measurements Xt and color image It (a). The global map Mt
920
+ | is composed of sequence of observed planes Ltr (b) and loops of rectilinear polygons
921
+ | built from the planes Rtr (d). After the registration of the current frame T t is found
922
+ | with respect to the global coordinate system, planes are extracted Pt (c), the system
923
+ | automatically update the room structure based on the observation Rtr (d).
924
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 18
925
+ blank |
926
+ |
927
+ |
928
+ text | array of plane labels for each pixel, Figure 2.3(c). The map Mt is composed of lists of
929
+ | observed axis-parallel planes Ltr and loops of current room structure Rtr , defined with
930
+ | subsets of the planes from Ltr . Each plane has its axis label (x, y, or z) and the offset
931
+ | value (e.g., x = x0 ), as well as its left or right plane if the connectivity is observed. A
932
+ | plane can be selected (shown as solid line in Figure 2.3(b)) or ignored (dotted line in
933
+ | Figure 2.3(b)) based on user input. The selected planes are extracted from Ltr as the
934
+ | loop of the room Rtr , which can be converted into the floor plan as a 2-D rectilinear
935
+ | polygon. To have a fully connected a rectilinear polygon per room, Rtr is constrained
936
+ | to have alternating axis labels (x and y). For the z direction (vertical direction), the
937
+ | system retains only the ceiling and the floor. The system also keeps the sequence of
938
+ | observation (S x , S y , and S z ) of offset values for each axis direction, and stores the
939
+ | measured distance and the uncertainty of the measurement between planes.
940
+ | The overall reconstruction process is summarized in Figure 2.2. As mentioned in
941
+ | Sec. 2.2, this process is initiated by extracting three mutually orthogonal planes when
942
+ | a user points the system to one of the corners or a room. To detect planes in the range
943
+ | data, our system fits plane equations to groups of range points and their corresponding
944
+ | normals using the RANSAC algorithm [FB81]: the system first randomly samples a
945
+ | few points, then fits a plane equation to them. the system then tests the detected
946
+ | plane by counting the number of points that can be explained by the plane equation.
947
+ | After convergence, the detected plane is classified as valid only if the detected points
948
+ | constitute a large, connected portion of the depth information within the frame. If
949
+ | there are three planes detected, and they are orthogonal to each other, our system
950
+ | assigns the x, y and z axes to be the normal directions of these three planes, which
951
+ | form the right-handed coordinate system for our Manhattan world. Now the map Mt
952
+ | has two planes (the floor or ceiling is excluded), and the transformation T t between
953
+ | Mt and Ft is also found.
954
+ | A new measurement Ft is registered with the previous frame Ft−1 by aligning
955
+ | depth and color features (Sec. 2.3.1). This registration is used to update T t−1 to a
956
+ | new transformation T t . The system extracts planes that satisfy the Manhattan-world
957
+ | assumption from T t (Ft ) (Sec. 2.3.2). If the extracted planes already exist in Ltr , the
958
+ | current measurement is compared with the global map and the registration is refined
959
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 19
960
+ blank |
961
+ |
962
+ |
963
+ |
964
+ text | (a) (b) (c) (d)
965
+ blank |
966
+ text | Figure 2.4: (a) Flat wall features (depicted by the triangle and circle) are observed
967
+ | from two different locations. Diagram (b) shows both observations with respect to
968
+ | the camera coordinate system. Without features, using projection-based ICP can
969
+ | lead to registration errors in the image-plane direction (c), while the use of features
970
+ | will provide better registration (d).
971
+ blank |
972
+ text | (Sec. 2.3.3). If there is a new plane extracted, or if there is user input to specify the
973
+ | map structure, the map is updated accordingly (Sec. 2.3.4).
974
+ blank |
975
+ |
976
+ title | 2.3.1 Pair-Wise Registration
977
+ text | To propagate information from previous frames and to detect new planes in the scene,
978
+ | each incoming frame must be registered with respect to the global coordinate system.
979
+ | To start this process, the system finds the relative registration between the two most
980
+ | recent frames, Ft−1 and Ft . By using both the depth point clouds (Xt−1 , Xt ) and
981
+ | optical images (It−1 , It ), the system can efficiently register frames in real time (about
982
+ | 15 fps).
983
+ | Given two sets of point clouds, Xt−1 = {xt−1 N t t N
984
+ | i }i=1 and X = {xi }i=1 , and the
985
+ | transformation for the previous point cloud T t−1 , the correct rigid transformation T t
986
+ | will minimize the error between correspondences in the two sets:
987
+ blank |
988
+ text | X
989
+ | mint kwi (T t−1 (xit−1 ) − T t (yit ))k2 (2.1)
990
+ | yi ,T
991
+ | i
992
+ blank |
993
+ text | yit ∈ Xt is the corresponding point for xt−1
994
+ | i ∈ Xt−1 . Once the correspondence is
995
+ | known, minimizing Eq. (2.1) becomes a closed-form solution [BM92]. In conventional
996
+ | approaches, correspondence is found by searching for the closest point, which is com-
997
+ | putationally expensive. Real-time registration methods reduce the cost by projecting
998
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 20
999
+ blank |
1000
+ |
1001
+ |
1002
+ |
1003
+ text | (a) it−1 ∈ It−1 (b) j t ∈ It (c) Ht (It−1 ) (d) |I t − Ht (It−1 )|
1004
+ blank |
1005
+ text | Figure 2.5: From optical flow between two consecutive frames, sparse image features
1006
+ | are matched between (a) it−1 ∈ It−1 and (b) j t ∈ It . The matched features are then
1007
+ | used to calculate homography Ht such that the previous image It−1 can be warped to
1008
+ | the space of the current image It and create dense projective correspondences (c). The
1009
+ | difference image (d) shows that most of dense correspondences are within a few-pixel
1010
+ | error in image plane with slight offset around silhouette area.
1011
+ blank |
1012
+ text | the 3-D points onto a 2-D image plane and assigning correspondences to points that
1013
+ | project onto the same pixel locations [RL01]. However, projection will only reduce the
1014
+ | distance in the ray direction; the offset parallel to the image plane cannot be adjusted.
1015
+ | This phenomenon can result in the algorithm not compensating for the translation
1016
+ | parallel to the plane and therefore shrinking the size of the room (Figure 2.4).
1017
+ | Our pair-wise registration is similar to [RL01], but it compensates for the dis-
1018
+ | placement parallel to the image plane using image features and silhouette points.
1019
+ | Intuitively, the system uses homography to compensate for errors parallel to the
1020
+ | plane if the structure can be approximated into a plane, and silhouette points are
1021
+ | used to compensate for remaining errors when the features are not planar.
1022
+ | Our system first computes the optical flow between color images It and It−1 and
1023
+ | finds a sparse set of features matched between them, Figure 2.5(a)(b). The sparse set
1024
+ | of features then can be used to create dense projective correspondence between the
1025
+ | two frames, Figure 2.5(c)(d). More specifically, homography is a transform between
1026
+ | 2-D homogeneous coordinates defined by a matrix H ∈ R3×3 :
1027
+ blank |
1028
+ text |    
1029
+ | ui wuj
1030
+ | X
1031
+ | kHit−1 − j t k2 , where it−1
1032
+ |    
1033
+ | min = v i
1034
+ |  ∈ It−1 , j t =  wvj  ∈ It (2.2)
1035
+ | H    
1036
+ | it−1 ,j t
1037
+ | 1 w
1038
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 21
1039
+ blank |
1040
+ |
1041
+ |
1042
+ |
1043
+ text | Figure 2.6: Silhouette points. There are two different types of depth discontinuity:
1044
+ | the boundaries of a shadow made on the background by a foreground object (empty
1045
+ | circles), and the boundaries of a foreground object (filled circles). The meaningful
1046
+ | depth features are the foreground points, which are the silhouette points used for our
1047
+ | registration pipeline.
1048
+ blank |
1049
+ text | Compared to naive projective correspondence used in [RL01], a homography de-
1050
+ | fines a map between two planar surfaces in 3-D space. The homography represents
1051
+ | the displacement parallel to the image plane, and is used to compute dense corre-
1052
+ | spondences between the two frames. While a homography does not represent a full
1053
+ | transformation in 3-D, the planar approximation works well in practice for our sce-
1054
+ | nario, where the scene is mostly composed of flat planes and the relative movement is
1055
+ | small. From the second iteration, the correspondence is found by projecting individual
1056
+ | points onto the image plane, as shown in [RL01].
1057
+ | Given the correspondence, the registration between the frames for the current iter-
1058
+ | ation can be given as a closed-form solution (Equation 2.1). Additionally, the system
1059
+ | modifies the correspondence for silhouette points (points of depth discontinuity in
1060
+ | the foreground, shown in Figure 2.6). For silhouette points in Xt−1 , the system finds
1061
+ | the closest silhouette points in Xt within a small search window from the original
1062
+ | corresponding location. If the matching silhouette point exists, the correspondence is
1063
+ | weighted more. (We used wi = 100 for silhouette points and wi = 1 for non-silhouette
1064
+ | points.) The process iterates until it converges.
1065
+ blank |
1066
+ title | Registration Failure
1067
+ blank |
1068
+ text | The real-time registration is a crucial part of our algorithm for accurate reconstruc-
1069
+ | tion. Even with the hybrid approach in which both color and depth features are used,
1070
+ | the registration can fail, and it is important to detect the failure immediately and
1071
+ | to recover the position of the sensor. The registration failure is detected either (1)
1072
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 22
1073
+ blank |
1074
+ |
1075
+ |
1076
+ text | if the pair-wise registration does not converge or (2) if there were not enough color
1077
+ | and depth features. The first case can be easily detected as the algorithm runs. The
1078
+ | second case is detected if the optical flow did not find homography (i.e., there is a
1079
+ | lack of color feature) and there were not enough matched silhouette points (i.e., there
1080
+ | is a lack of depth feature).
1081
+ | In cases of registration failure, the projected image turns red, indicating that the
1082
+ | user should return the system’s viewpoint to the most recently observed corner. This
1083
+ | movement usually takes only a small amount of back-tracking because the failure
1084
+ | is detected within milliseconds of leaving the previous successfully registered area.
1085
+ | Similar to the initialization step, the system extracts planes from Xt using RANSAC
1086
+ | and matches the planes with the desired corner. Figure 2.2 (b) depicts the process of
1087
+ | overcoming a registration failure. The user then deliberately moves the sensor along
1088
+ | the path with richer features or steps farther from a wall to cover a wider view.
1089
+ blank |
1090
+ |
1091
+ title | 2.3.2 Plane Extraction
1092
+ text | Based on the transformation T t , the system extracts axes-aligned planes and asso-
1093
+ | ciated edges. The planes and detected features will provide higher-level information
1094
+ | that relates the raw point cloud Xt to the global map Mt . Because the system only
1095
+ | considers the planes that satisfy the Manhattan-world coordinate system, we were
1096
+ | able to simplify the plane detection procedure.
1097
+ | The planes from the previous frame that remain visible can be easily found by
1098
+ | using the correspondence. From the pair-wise registration (Sec. 2.3.1), our system
1099
+ | has the point-wise correspondence between the previous frame and the current frame.
1100
+ | The plane label Pt−1 from the previous frame is updated simply by being copied over
1101
+ | to the corresponding location. Then, the system refines Pt by alternating between
1102
+ | fitting points and fitting parameters.
1103
+ | A new plane can be found by projecting remaining points for the x, y, and z axes.
1104
+ | For each axis direction, a histogram is built with the bin size 20cm. The system then
1105
+ | tests the plane equation for populated bins. Compared to the RANSAC procedure
1106
+ | for initialization, the Manhattan-world assumption reduces the number of degrees of
1107
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 23
1108
+ blank |
1109
+ |
1110
+ |
1111
+ text | freedom from three to one, making plane extraction more efficient.
1112
+ | For extracted planes, the boundary edges are also extracted; the system detects
1113
+ | groups of boundary points that can be explained by an axis-parallel line segment.
1114
+ | The system also retains the information about relative positions for extracted planes
1115
+ | (left/right). As long as the sensor is not flipped upside-down, this information pro-
1116
+ | vides an important cue to build a room with the correct topology, even when the
1117
+ | connectivity between neighboring planes has not been observed.
1118
+ blank |
1119
+ title | Data Association
1120
+ blank |
1121
+ text | After the planes are extracted, the data association process finds the link between the
1122
+ | global map Mt and the extracted planes to be Pt , a 2-D array of plane labels for each
1123
+ | pixel. The system automatically finds plane labels that existed from the previous
1124
+ | frame and extract the plane by copying over the plane labels using correspondences.
1125
+ | The plane labels for the newly detected plane can be found by comparing T t (Ft )
1126
+ | and Mt . In addition to the plane equation, the relative position of the newly observed
1127
+ | plane with respect to other observed planes is used to label the plane. If the plane
1128
+ | has not been previously observed, a new plane will be added into Ltr based on the
1129
+ | left-right information.
1130
+ | After the data association step, the system updates the sequence of observation
1131
+ | S. The planes that have been assigned as previously observed are used for global
1132
+ | adjustment (Sec. 2.3.3). If a new plane is observed, the room Rtr will be updated
1133
+ | accordingly (Sec. 2.3.4).
1134
+ blank |
1135
+ |
1136
+ title | 2.3.3 Global Adjustment
1137
+ text | Due to noise in the point cloud, frame-to-frame registration is not perfect, and er-
1138
+ | ror accumulates over time. This is a common problem in pose estimation. Large-
1139
+ | scale localization approaches use bundle adjustment to compensate error accumula-
1140
+ | tion [TMHF00, Thr02]. Enforcing this global constraint involves detecting landmark
1141
+ | objects, or stationary objects observed at different times during a sequence of mea-
1142
+ | surements. Usually this global adjustment becomes an optimization problem in many
1143
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 24
1144
+ blank |
1145
+ |
1146
+ |
1147
+ |
1148
+ text | Figure 2.7: As errors accumulate in T t and in measurements, the map Mt becomes
1149
+ | inconsistent. By comparing previous and recent measurements, the system can correct
1150
+ | for inconsistency and update the value of c such that c = a.
1151
+ blank |
1152
+ text | dimensions. The problem is formulated by constraining the landmarks to predefined
1153
+ | global locations, and by solving an energy function that encodes noise in a pose es-
1154
+ | timation of both sensor and landmark locations. The Manhattan-world assumption
1155
+ | allows us to reduce the error accumulation efficiently in real time by refining our
1156
+ | registration estimate and by optimizing the global map.
1157
+ blank |
1158
+ title | Refining the Registration
1159
+ blank |
1160
+ text | After data association, the system performs a second round of registration with re-
1161
+ | spect to the global map Mt to reduce the error accumulation in T t by incremental,
1162
+ | pair-wise registration. The extracted planes Pt , if already observed by the system,
1163
+ | have been assigned to the planes in Mt that have associated plane equations. For
1164
+ | example, suppose a point T t (xu,v ) = (x, y, z) has a plane label Pt (u, v) = pk (assigned
1165
+ | to plane k). If plane k has normal parallel to the x axis, the plane equation in the
1166
+ | global map Mt can be written as x = x0 (x0 ∈ R). Consequently, the registration
1167
+ | should be refined to minimize kx − x0 k2 . In other words, the refined registration can
1168
+ | be found by defining the corresponding point for xu,v as (x0 , y, z). The corresponding
1169
+ | points are likewise assigned for every point with a plane assignment in Pt . Given the
1170
+ | correspondence, the system can refine the registration between the current frame Ft
1171
+ | and the global map Mt . This second round of registration reduces the error in the
1172
+ | axis direction. In our example, the refinement is active while the plane x = x0 is
1173
+ | visible and reduces the uncertainty in the x direction with respect to the global map.
1174
+ | The error in the x direction is not accumulated during the interval.
1175
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 25
1176
+ blank |
1177
+ |
1178
+ |
1179
+ title | Optimizing the Map
1180
+ blank |
1181
+ text | As error accumulates, the reconstructed map Mt may also require global adjust-
1182
+ | ment in each axis direction. The Manhattan-world assumption simplifies this global
1183
+ | optimization into two separate, one-dimensional problems (we are excluding the z
1184
+ | direction for now, but the idea can be extended to a 3-D case).
1185
+ | Figure 2.7 shows a simple example in the x-axis direction. Let us assume that
1186
+ | the figure represents an overhead view of a rectangular room. There should be two
1187
+ | walls whose normals are parallel to the x-axis. The sensor detects the first wall
1188
+ | (x = a), sweeps around the room, observes another wall (x = b), and returns to
1189
+ | the previously observed wall. Because of error accumulation, parts of the same wall
1190
+ | have two different offset values (x = a and x = c), but by observing the left-right
1191
+ | relationship between walls, the system infers that the two walls are indeed the same
1192
+ | wall.
1193
+ | To optimize the offset values, the system tracks the sequence of observations
1194
+ | S x = {a, b, c} and the variances at the point of observation for each wall, as well as the
1195
+ | constraints represented by the pair of the same offset values C x = {(c11 , c12 ) = (a, c)}.
1196
+ | We introduce two random variables, ∆1 and ∆2 , to constrain the global map op-
1197
+ | timization. ∆1 is a random variable with mean m1 = b − a and variance σ12 that
1198
+ | represents the error between the moment when the sensor observes the x = a wall
1199
+ | and the moment it observes the x = b wall. Likewise, a random variable ∆2 represents
1200
+ | the error with mean m2 = c − b and variance σ22 .
1201
+ | Whenever a new constraint is added, or when the system observes a plane that
1202
+ | was previously observed, the global adjustment routine is triggered. This is usually
1203
+ | when the user finishes scanning a room by looping around it and returning to the
1204
+ | first wall measured. By confining the axis direction, the global adjustment becomes
1205
+ | a one-dimensional quadratic equation:
1206
+ blank |  
1207
+ text | P k∆i −mi k2
1208
+ | minS x i σi2
1209
+ | (2.3)
1210
+ | x
1211
+ | s. t. cj1 = cj2 , ∀(cj1 , cj2 ) ∈ C .
1212
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 26
1213
+ blank |
1214
+ |
1215
+ |
1216
+ |
1217
+ text | Figure 2.8: Selection. In sequence (a), the user is observing two new planes in the
1218
+ | scene (colored white) and one currently included plane (colored blue). The user selects
1219
+ | one of the new planes by pointing at it and clicking. Then, the second new plane is
1220
+ | added. All planes are blue in the final frame, confirming that all planes have been
1221
+ | successfully selected. Sequence (b) shows a configuration where the user has decided
1222
+ | not to include the large cabinet. Sequence (c) shows successful selection of the ceiling
1223
+ | and the wall despite clutter.
1224
+ blank |
1225
+ title | 2.3.4 Map Update
1226
+ text | Our algorithm ignores most irrelevant features by using the Manhattan-world as-
1227
+ | sumption. However, the system cannot distinguish architectural components from
1228
+ | other axis-aligned objects using the Manhattan-world assumption. For example, fur-
1229
+ | niture, open doors, parts of other rooms that might be visible, or reflections from
1230
+ | mirrors may be detected as axis-aligned planes. The system solves the challenging
1231
+ | cases by allowing the user to manually specify the planes that he or she would like to
1232
+ | include in the final model. This manual specification consists of simply clicking the
1233
+ | input button during scanning when pointing at a plane, as shown in Figure 2.8. If
1234
+ | the user enters a new room, a right click of the button indicates that the user wishes
1235
+ | to include this new room and to optimize it individually. The system creates a new
1236
+ | loop of planes, and any newly observed planes are added to the loop.
1237
+ | Whenever a new plane is added to Ltr or there is user input to specify the room
1238
+ | structure, the map update routine extracts a 2-D rectilinear polygon Rtr from Ltr with
1239
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 27
1240
+ blank |
1241
+ |
1242
+ |
1243
+ text | 5.797, 5% 0.104, data i/o
1244
+ | 11.845, 10% prepare image pre-processing
1245
+ | 6.728, 6% 0%
1246
+ | 3.318, 3% optical flow (25%)
1247
+ | 14.517, 13.203, 12% pair-wise registration
1248
+ | 13% plane extraction
1249
+ | data association
1250
+ | [unit: ms] refine registration
1251
+ | 58.672, 51% optimize map
1252
+ blank |
1253
+ text | Figure 2.9: The average computational time for each step of the system.
1254
+ blank |
1255
+ text | the help of user input. A valid rectilinear polygon structure should have alternating
1256
+ | axis directions for any pair of adjacent walls (a x = xi wall should be connected to
1257
+ | a y = yj wall). The system starts by adding all selected planes into Rtr as well as
1258
+ | whichever unselected planes in Ltr are necessary to have alternating axis direction.
1259
+ | When planes are added, the planes with observed boundary edges are preferred. If
1260
+ | the two observed walls have the same axis direction, the unobserved wall is added
1261
+ | between them on the boundary of the planes to form a complete loop.
1262
+ blank |
1263
+ |
1264
+ title | 2.4 Evaluation
1265
+ text | The goal of the system is building a floor plan of an any possible interior environment.
1266
+ | In our testing of the system, we mapped different apartments of six different volunteers
1267
+ | ranging approximately 500-2000 ft2 located at Palo Alto. The residents were living
1268
+ | in the scanned places and thus the apartments exhibited different amounts and types
1269
+ | of objects.
1270
+ | For each data set, we compare the floor plan generated by our system with one
1271
+ | manually-generated using measurements from a commercially available measuring
1272
+ | device.1 The current practice in architecture and real estate is to use a point-to-
1273
+ | point laser device to measure distances between pairs of parallel planes. Making
1274
+ | such measurements requires a clear, level line of sight between two planes, which
1275
+ meta | 1
1276
+ text | measuring range 0.05 to 40m; average measurement accuracy +/- 1.5mm; measurement duration
1277
+ | < 0.5s to 4s per measurement.
1278
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 28
1279
+ blank |
1280
+ |
1281
+ |
1282
+ text | may be time-consuming to find due to the presence of furniture, windows, and other
1283
+ | obstructions. Moreover, after making all the distance measurements, a user is required
1284
+ | to manually draw a floor plan that respects the measurements. In our tests, roughly
1285
+ | 10-20 minutes were needed to build a floor plan of each apartment in the conventional
1286
+ | way as described.
1287
+ | Using our system, the data acquisition process took approximately 2-5 minutes per
1288
+ | apartment to initiate, run, and generate the full floor plan. Table 2.1 summarizes the
1289
+ | timing data for each data set. The average frame rate is 7.5 frames per second running
1290
+ | on an Intel 2.50GHz Dual Core laptop. Figure 2.9 depicts the average computational
1291
+ | time for each step of the algorithm. The pair-wise registration routine (Sec.2.3.1)
1292
+ | contributes more than half of the computational time followed by the pre-processing
1293
+ | step of fetching a new frame and calculating optical flow (25%).
1294
+ | In Figure 2.10, we visually compare the floor plans reconstructed in a conventional
1295
+ | way with those built by our system. The floor plans in blue were reconstructed using
1296
+ | point-to-point laser measurements, and the floor plans in red were reconstructed by
1297
+ | our system. For each apartment, the topology of the reconstructed walls agrees with
1298
+ | the manually-constructed floor plan. In all cases the detection and labeling of planar
1299
+ | surfaces by our algorithm enabled the user to add or remove these surfaces from
1300
+ | the model in real time, allowing the final model to be constructed using only the
1301
+ | important architectural elements from the scene.
1302
+ | The overlaid floor plans in Figure 2.10(c) show that that the relative placement of
1303
+ | the rooms may be misaligned. This is because our global adjustment routine optimizes
1304
+ | rooms individually, thus errors can accumulate in transitions between rooms. The
1305
+ | algorithm could be extended to enforce global constraints on the relative placement
1306
+ | of rooms, such as maintaining a certain wall thickness and/or aligning the outer-most
1307
+ | walls, but such global constraints may induce other errors.
1308
+ | Table 2.1 contains a quantitative comparison of the errors. The reported depth
1309
+ | resolution of the sensor is 0.01m at 2m, and for each model we have an average of
1310
+ | 0.075m error per wall. The relative error stays in the range of 2-5%, which shows
1311
+ | that the accumulation of small registration errors continues to accumulate as more
1312
+ | frames are processed.
1313
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 29
1314
+ blank |
1315
+ |
1316
+ text | data no. of run average error
1317
+ | fps
1318
+ | set frames time m %
1319
+ | 1 1465 2m 56s 8.32 0.115 4.14
1320
+ | 2 1009 1m 57s 8.66 0.064 1.90
1321
+ | 3 2830 5m 19s 8.88 0.053 2.40
1322
+ | 4 1129 2m 39s 7.08 0.088 2.34
1323
+ | 5 1533 3m 52s 6.59 0.178 3.52
1324
+ | 6 2811 7m 4s 6.65 0.096 3.10
1325
+ | ave. 1795 3m 57s 7.54 0.075 2.86
1326
+ blank |
1327
+ text | Table 2.1: Accuracy comparison between floor plans reconstructed by our system, and
1328
+ | manually constructed floor plans generated from point-to-point laser measurements.
1329
+ blank |
1330
+ text | Fundamentally, the limitations of our method reflect the limitations of the Kinect
1331
+ | sensor, namely, the processing power of the laptop and the assumptions made in our
1332
+ | approach. Because the accuracy of real-time depth data is worse than that from
1333
+ | visual features, our approach exhibits larger errors compared to visual SLAM (e.g.,
1334
+ | [ND10]). Some of the uncertainty can be reduced by adapting approaches from the
1335
+ | well-explored visual SLAM literature. Still, the system is limited when meaningful
1336
+ | features can not be detected. The Kinect sensor’s reported measurement range is
1337
+ | between 1.2 and 3.5m from an object; outside that range, data is noisy or unavailable.
1338
+ | As a consequence, data in narrow hallways or large atriums was difficult to collect.
1339
+ | Another source of potential error is a user outpacing the operating rate of approx-
1340
+ | imately 7.5 fps. This frame rate already allows for a reasonable data capture pace,
1341
+ | but with more processing power, the pace of the system could always be guaranteed
1342
+ | to exceed normal human motion.
1343
+ blank |
1344
+ |
1345
+ title | 2.5 Conclusions and Future Work
1346
+ text | We have presented an interactive system that allows a user to capture accurate ar-
1347
+ | chitectural information and to automatically generate a floor plan. Leveraging the
1348
+ | Manhattan-world assumption, we have created a representation that is tractable in
1349
+ | real time while ignoring clutter. In the presented system, the current status of the
1350
+ | reconstruction is projected on the scanned environment to enable the user to provide
1351
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 30
1352
+ blank |
1353
+ |
1354
+ |
1355
+ text | high-level feedback to the system. This feedback helps overcome ambiguous situa-
1356
+ | tions and allows the user to interactively specify the important planes that should be
1357
+ | included in the model.
1358
+ | If there are not enough features scanned for the system to determine that the
1359
+ | operator has moved, the system will assume that motion has not occurred, leading to
1360
+ | general underestimation of wall lengths when no depth or image features are available.
1361
+ | The challenges can be overcome by including an IMU or other devices to assist in the
1362
+ | pose tracking of the system.
1363
+ | We have limited our Manhattan-world features to axis-aligned planes in vertical
1364
+ | directions. However, in future work, we could generalize the system to handle rec-
1365
+ | tilinear polyhedra which are not convex in the vertical direction. Furthermore, the
1366
+ | world could be expanded to include walls that are not aligned with the axes of the
1367
+ | global coordinate system.
1368
+ | More broadly, our interactive system can be extended to other applications in
1369
+ | indoor environments. For example, a user could visualize modifications to the space
1370
+ | shown in Figure 2.11, where we show a user clicking and dragging a cursor across a
1371
+ | plane to “add” a window. This example illustrates the range of possible uses of our
1372
+ | system.
1373
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 31
1374
+ blank |
1375
+ |
1376
+ |
1377
+ |
1378
+ text | house 1
1379
+ blank |
1380
+ |
1381
+ |
1382
+ |
1383
+ text | house 2
1384
+ blank |
1385
+ |
1386
+ |
1387
+ |
1388
+ text | house 3
1389
+ blank |
1390
+ |
1391
+ |
1392
+ |
1393
+ text | house 4
1394
+ blank |
1395
+ |
1396
+ |
1397
+ |
1398
+ text | house 5
1399
+ blank |
1400
+ |
1401
+ |
1402
+ |
1403
+ text | house 6
1404
+ | (a) (b) (c)
1405
+ blank |
1406
+ text | Figure 2.10: (a) Manually constructed floor plans generated from point-to-point laser
1407
+ | measurements, (b) floor plans acquired with our system, and (c) overlay. For house
1408
+ | 4, some parts (pillars in large open space, stairs, and an elevator) are ignored by the
1409
+ | user. The system still uses the measurements from those parts and other objects to
1410
+ | correctly understand the relative positions of the rooms.
1411
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 32
1412
+ blank |
1413
+ |
1414
+ |
1415
+ |
1416
+ text | Figure 2.11: The system, having detected the planes in the scene, also allows the user
1417
+ | to interact directly with the physical world. Here the user adds a window to the room
1418
+ | by dragging a cursor across the wall (left). This motion updates the internal model
1419
+ | of the world (right).
1420
+ meta | Chapter 3
1421
+ blank |
1422
+ title | Acquiring 3D Indoor Environments
1423
+ | with Variability and Repetition2
1424
+ blank |
1425
+ text | Unlike mapping of urban environments, interior mapping would focus on interior
1426
+ | objects, which can be geometrically complex, located in cluttered setting and undergo
1427
+ | significant variations. In addition, the indoor 3-D data captured from RGB-D cameras
1428
+ | suffer from limited resolution and data quality. The process is further complicated
1429
+ | when the model deforms between successive acquisitions. The work described in this
1430
+ | chapter focused on acquiring and understanding objects in interiors of public buildings
1431
+ | (e.g., schools, hospitals, hotels, restaurants, airports, train stations) or office buildings
1432
+ | from RGB-D camera scans of such interiors.
1433
+ | We exploited three observations to make the problem of indoor 3D acquisition
1434
+ | tractable: (i) most such building interiors are composed of basic elements such as
1435
+ | walls, doors, windows, furniture (e.g., chairs, tables, lamps, computers, cabinets),
1436
+ | which come from a small number of prototypes and repeat many times. (ii) such
1437
+ | building components usually consist of rigid parts of simple geometry, i.e., they have
1438
+ | surfaces that are well approximated by planar, cylindrical, conical, spherical proxies.
1439
+ | Further, although variability and articulation are dominant (e.g., a chair is moved
1440
+ meta | 2
1441
+ text | The contents of the chapter was originally published as Young Min Kim, Niloy J. Mitra,
1442
+ | Dong-Ming Yan, and Leonidas Guibas. 2012. Acquiring 3D indoor environments with vari-
1443
+ | ability and repetition. ACM Trans. Graph. 31, 6, Article 138 (November 2012), 11 pages.
1444
+ | DOI=10.1145/2366145.2366157 http://doi.acm.org/10.1145/2366145.2366157.
1445
+ blank |
1446
+ |
1447
+ meta | 33
1448
+ | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 34
1449
+ blank |
1450
+ |
1451
+ |
1452
+ |
1453
+ text | office scene
1454
+ blank |
1455
+ |
1456
+ |
1457
+ |
1458
+ text | input single-view scan recognized objects retrieved and posed models
1459
+ blank |
1460
+ |
1461
+ text | Figure 3.1: (Left) Given a single view scan of a 3D environment obtained using a
1462
+ | fast range scanner, the system performs scene understanding by recognizing repeated
1463
+ | objects, while factoring out their modes of variability (middle). The repeating ob-
1464
+ | jects have been learned beforehand as low-complexity models, along with their joint
1465
+ | deformations. The system extracts the objects despite a poor-quality input scan with
1466
+ | large missing parts and many outliers. The extracted parameters can then be used
1467
+ | to pose 3D models to create a plausible scene reconstruction (right).
1468
+ blank |
1469
+ text | or rotated, a lamp arm is bent and adjusted), such variability is limited and low-
1470
+ | dimensional (e.g., translational motion, hinge joint, telescopic joint). (iii) mutual
1471
+ | relationships among the basic objects satisfy strong priors (e.g., a chair stands on the
1472
+ | floor, a monitor rests on the table).
1473
+ | We present a simple yet practical system to acquire models of indoor objects such
1474
+ | as furniture, together with their variability modes, and discover object repetitions
1475
+ | and exploit them to speed up large-scale indoor acquisition towards high-level scene
1476
+ | understanding. Our algorithm works in two phases. First, in the learning phase, the
1477
+ | system starts from a few scans of individual objects to construct primitive-based 3D
1478
+ | models while explicitly recovering respective joint attributes and modes of variation.
1479
+ | Second, in the fast recognition phase (about 200ms/model), the system starts from a
1480
+ | single-view scan to segment and classify it into plausible objects, recognize them, and
1481
+ | extract the pose parameters for the low-complexity models generated in the learning
1482
+ | phase. Intuitively, our system uses priors for primitive types and their connections,
1483
+ | thus greatly reducing the number of unknowns to enable model fitting even from
1484
+ | very sparse and low-resolution datasets, while hierarchically associating subsets of
1485
+ | scans to parts of objects. We also demonstrate that simple inter- and intra-object
1486
+ | relations simplify segmentation and classification tasks necessary for high-level scene
1487
+ | understanding (see [MPWC12] and references therein).
1488
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 35
1489
+ blank |
1490
+ |
1491
+ |
1492
+ text | We tested our method on a range of challenging synthetic and real-world scenes.
1493
+ | We present, for the first time, basic scene reconstruction for massive indoor scenes
1494
+ | (e.g., office spaces, building auditoriums on a university campus) from unreliable
1495
+ | sparse data by exploiting the low-complexity variability of common scene objects. We
1496
+ | show how we can now detect meaningful changes in an environment. For example,
1497
+ | our system was able to discover a new object placed in a office space by rescanning the
1498
+ | scene, despite articulations and motions of the previously extant objects (e.g., desk,
1499
+ | chairs, monitors, lamps). Thus, the system factors out nuisance modes of variability
1500
+ | (e.g., motions of the chairs, etc.) from variability modes that has importance in an
1501
+ | application (e.g., security, where the new scene objects should be flagged).
1502
+ blank |
1503
+ |
1504
+ title | 3.1 Related Work
1505
+ blank |
1506
+ title | 3.1.1 Scanning Technology
1507
+ text | Rusinkiewicz et al. [RHHL02] demonstrated the possibility of real-time lightweight 3D
1508
+ | scanning. More generally, surface reconstruction from unorganized pointcloud data
1509
+ | has been extensively studied in computer graphics, computational geometry, and
1510
+ | computer vision (see [Dey07]). Further, powered by recent developments in real-time
1511
+ | range scanning, everyday users can now easily acquire 3D data at high frame-rates.
1512
+ | Researchers have proposed algorithms to accumulate multiple poor-quality individual
1513
+ | frames to obtain better quality pointclouds [MFO+ 07, HKH+ 12, IKH+ 11]. Our main
1514
+ | goal differed, however, because our system focused on recognizing important elements
1515
+ | and semantically understanding large 3D indoor environments.
1516
+ blank |
1517
+ |
1518
+ title | 3.1.2 Geometric Priors for Objects
1519
+ text | Our system utilizes geometry on the level of individual objects, which are possible
1520
+ | abstractions used by humans to understand the environment [MZL+ 09]. Similar to Xu
1521
+ | et al. [XLZ+ 10], we understand an object as a collection of primitive parts and segment
1522
+ | the object based on the prior. Such a prior can successfully fill regions of missing
1523
+ | parts [PMG+ 05], infer plausible part motions of mechanical assemblies [MYY+ 10],
1524
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 36
1525
+ blank |
1526
+ |
1527
+ |
1528
+ text | extract shape by deforming a template model to match silhouette images [XZZ+ 11],
1529
+ | locate an object from photographs [XS12], or semantically edit images based of simple
1530
+ | scene proxies [ZCC+ 12].
1531
+ | The system focuses on locating 3D deformable objects in unsegmented, noisy,
1532
+ | single-view data in a cluttered environment. Researchers have used non-rigid align-
1533
+ | ment to better align (warped) multiple scans [LAGP09]. Alternately, temporal infor-
1534
+ | mation across multiple frames can be used to track and recover a deformable model
1535
+ | with joints between rigid parts [CZ11]. Instead, our system learns an instance-specific
1536
+ | geometric prior as a collection of simple primitives along with deformation modes from
1537
+ | a very small number of scans. Note that the priors are extracted in the learning stage,
1538
+ | rather than being hard coded in the framework. We demonstrate that such models
1539
+ | are sufficiently representative to extract the essence of real-world indoor scenes (see
1540
+ | also concurrent efforts by Nan et al. [NXS12] and Shao et al [SXZ+ 12].)
1541
+ blank |
1542
+ |
1543
+ title | 3.1.3 Scene Understanding
1544
+ text | In the context of image understanding, Lee et al. [LGHK10] constructed a box-
1545
+ | based reconstruction of indoor scenes using volumetric considerations, while Gupta
1546
+ | et al. [GEH10] applied geometric constraints and physical considerations to obtain a
1547
+ | block-based 3D scene model. In the context of range scans, there have been only a few
1548
+ | efforts: Triebel et al. [TSS10] presented an unsupervised algorithm to detect repeating
1549
+ | parts by clustering on pre-segmented input data, while Koppula et al. [KAJS11] used
1550
+ | a graphical model to learn features and contextual relations across objects. Earlier,
1551
+ | Schnabel et al. [SWWK08] detected features in large point clouds using constrained
1552
+ | graphs that describe configurations of basic shapes (e.g., planes, cylinders, etc.) and
1553
+ | then performed a graph matching, which cannot be directly used in large, cluttered
1554
+ | environments captured at low resolutions.
1555
+ | Various learning-based approaches have recently been proposed to analyze and
1556
+ | segment 3D geometry, especially towards consistent segmentation and part-label asso-
1557
+ | ciation [HKG11, SvKK+ 11]. While similar MRF or CRF optimization can be applied
1558
+ | in our settings, we found that a fully geometric algorithm can produce comparable
1559
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 37
1560
+ blank |
1561
+ |
1562
+ |
1563
+ text | high-quality recognition results without extensive training. In our setting, learning
1564
+ | amounts to recovering the appropriate deformation model for the scanned model
1565
+ | in terms of arrangement of primitives and their connection types. While most of
1566
+ | machine-learning approaches are restricted to local features and limited viewpoints,
1567
+ | our geometric approach successfully handles the variability of objects and utilizes
1568
+ | extracted high-level information.
1569
+ blank |
1570
+ text | Learning
1571
+ blank |
1572
+ |
1573
+ text | I11 I12 I13 ... M1
1574
+ blank |
1575
+ text | S I 21 I 22 I 23 ... M2
1576
+ | Recognition
1577
+ blank |
1578
+ |
1579
+ |
1580
+ |
1581
+ text | o1 , o2 ,...
1582
+ blank |
1583
+ text | Figure 3.2: Our algorithm consists of two main phases: (i) a relatively slow learn-
1584
+ | ing phase to acquire object models as collection of interconnect primitives and their
1585
+ | joint properties and (ii) a fast object recognition phase that takes an average of
1586
+ | 200 ms/model.
1587
+ blank |
1588
+ |
1589
+ |
1590
+ |
1591
+ title | 3.2 Overview
1592
+ text | Our framework works in two main phases: a learning phase and a recognition phase
1593
+ | (see Figure 3.2).
1594
+ | In the learning phase, our system scans each object of interest a few times (typi-
1595
+ | cally 5-10 scans across different poses). The goal is to consistently segment the scans
1596
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 38
1597
+ blank |
1598
+ |
1599
+ |
1600
+ text | into parts as well as identify the junction between part-pairs to recover the respective
1601
+ | junction attributes. Such a goal, however, is challenging given the input quality. We
1602
+ | address the problem using two scene characteristics: (i) many man-made objects are
1603
+ | well approximated by a collection of simple primitives (e.g., planes, boxes, cylinders)
1604
+ | and (ii) the types of junctions between such primitives are limited (e.g., hinge, trans-
1605
+ | lational) and of low-complexity. First, our system recovers a set of stable primitives
1606
+ | for each individual scan. Then, for each object, the system collectively processes
1607
+ | the scans to extract a primitive-based proxy representation along with the necessary
1608
+ | inter-part junction attributes to build a collection of models {M1 , M2 , . . . }.
1609
+ | In the recognition phase, the system starts with a single scan S of the scene.
1610
+ | First, the system extracts the dominant planes in the scene – typically they capture
1611
+ | the ground, walls, desks, etc. The system identifies the ground plane by using the
1612
+ | (approximate) up-vector from the acquisition device and noting that the points lie
1613
+ | above the ground. Planes parallel to the ground are tagged as tabletops if they are at
1614
+ | heights as observed in the training phase (typically 1′ -3′ ) by exploiting the fact that
1615
+ | working surfaces have similar heights across rooms. The system removes the points
1616
+ | associated with the ground plane and the candidate tabletop, and perform connected
1617
+ | component analysis on the remaining points (on a kn -nearest neighbor graph) to
1618
+ | extract pointsets {o1 , o2 , . . . }.
1619
+ | The system tests if each pointset oi can be satisfactorily explained by any of the
1620
+ | object models Mj . Note, however, that this step is difficult since the data is unreliable
1621
+ | and the objects can have large geometric variations due to changes in the position
1622
+ | and pose of objects. The system performs hierarchical matching which uses the
1623
+ | learned geometry, while trying to match individual parts first, and exploits simple
1624
+ | scene priors like (i) placement relations (e.g., monitors are placed on desks, chairs
1625
+ | rest on the ground) and (ii) allowable repetition modes (e.g., monitors usually repeat
1626
+ | horizontally, chairs are repeated on the ground). We assume such priors are available
1627
+ | as domain knowledge (e.g., Fisher et al. [FSH11]).
1628
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 39
1629
+ blank |
1630
+ |
1631
+ |
1632
+ |
1633
+ text | points super-points parts objects
1634
+ | I X = {x1 , x2 ,... } P = { p1 , p2 ,... } O = {o1 , o2 ,... }
1635
+ blank |
1636
+ |
1637
+ text | Figure 3.3: Unstructured input point cloud is processed into hierarchical data struc-
1638
+ | ture composed of super-points, parts, and objects.
1639
+ blank |
1640
+ title | 3.2.1 Models
1641
+ text | Our system represents the objects of interest as models that approximate the object
1642
+ | shapes while encoding deformation and relationship information (see also [OLGM11]).
1643
+ | Each model can be thought of as a graph structure, the nodes of which denote the
1644
+ | primitives and the edges of which encode the nodes’ connectivity and relationship
1645
+ | to the environment. Currently, the primitive types are limited to box, cylinder, and
1646
+ | radial structure. A box is used to represent a large flat structure; a cylinder is used to
1647
+ | represent a long and narrow structure; and a radial structure is used to capture parts
1648
+ | with discrete rotational symmetry (e.g., the base of a swivel chair). As an additional
1649
+ | regularization, the system groups parallel cylinders of similar lengths (e.g., legs of
1650
+ | a desk or arms of a chair), which in turn provides valuable cues for possible mirror
1651
+ | symmetries.
1652
+ | The connectivity between a pair of primitives is represented as their transfor-
1653
+ | mation relative to each other and their possible deformations. Our current imple-
1654
+ | mentation restricts deformations to be 1-DOF translation, 1-DOF rotation, and an
1655
+ | attachment. The system tests for translational joints for the cylinders and rotational
1656
+ | joints for cylinders or boxes (e.g., a hinge joint). An attachment represents the ex-
1657
+ | istence of a whole primitive node and is especially useful when, depending on the
1658
+ | configuration, the segmentation of the primitive is ambiguous. For example, the ge-
1659
+ | ometry of doors or drawers of cabinets is not easily segmented when they are closed,
1660
+ | and thus they are handled as an attachment when opened.
1661
+ | Additionally, the system detects contact information for the model, i.e., whether
1662
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 40
1663
+ blank |
1664
+ |
1665
+ |
1666
+ text | the object rests on the ground or on a desk. Note that the system assumes that the
1667
+ | vertical direction is known for the scene. Both the direction of the model and the
1668
+ | direction of the ground define a canonical object transformation.
1669
+ blank |
1670
+ |
1671
+ title | 3.2.2 Hierarchical Structure
1672
+ text | For both the learning and recognition phases, the raw input is unstructured point
1673
+ | clouds. The input is hierarchically organized by considering neighboring points and
1674
+ | assign contextual information for each hierarchy level. The scene hierarchy has three
1675
+ | levels of segmentation (see Figure 3.3):
1676
+ blank |
1677
+ text | • super-points X = {x1 , x2 , ...};
1678
+ | • parts P = {p1 , p2 , ...} (association Xp = {x : P (x) = p}); and
1679
+ | • objects O = {o1 , o2 , ...} (association Po = {p : O(p) = o}).
1680
+ blank |
1681
+ text | Instead of working directly on individual points, our system uses super-points
1682
+ | x ∈ X as the atomic entities (analogous to super-pixels in images). The system
1683
+ | creates super-points by uniformly sampling points from the raw measurements and
1684
+ | associating local neighborhoods with the samples based on the normal consistency
1685
+ | of points. Such super-points, or a group of points within a small neighborhood, are
1686
+ | less noisy, while at the same time they are sufficiently small to capture the input
1687
+ | distribution of points.
1688
+ | Next, our system aggregates neighboring super-points into primitive parts p ∈ P .
1689
+ | Such parts are expected to relate to individual primitives of models. Each part p
1690
+ | comprises a set of superpoints Xp . The system initially find such parts by merging
1691
+ | neighboring super-points until the region can no longer be approximated by a plane
1692
+ | (in a least squares sense) with average error less than a threshold θdist . Note that the
1693
+ | initial association of super-points with parts can change later.
1694
+ | Objects form the final hierarchy level during the recognition phase for scenes con-
1695
+ | taining multiple objects. Objects, having been segmented, are mapped to individ-
1696
+ | ual instances of models, while the association between objects and parts (O(p) ∈
1697
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 41
1698
+ blank |
1699
+ |
1700
+ |
1701
+ text | {1, 2, · · · , No } and Po ) are discovered during the recognition process. Note that dur-
1702
+ | ing the learning phase the system deals with only one object at a time and hence
1703
+ | such segmentation is trivial.
1704
+ | The system creates such a hierarchy in the pre-processing stage using the following
1705
+ | parameters in all our tests: number of nearest neighbor kn used for normal estimation,
1706
+ | sampling rate fs for super-points, and distance threshold θdist , which reflects the
1707
+ | approximate noise level. Table 3.1 shows the actual values.
1708
+ blank |
1709
+ text | param. values usage
1710
+ | kn 50 number of nearest neighbor
1711
+ | fs 1/100 sampling rate
1712
+ | θdist 0.1m distance threshold for segmentation
1713
+ | Ñp 10-20 Equation 3.1
1714
+ | θheight 0.5 Equation 3.5
1715
+ | θnormal 20◦ Equation 3.6
1716
+ | θsize 2θdist Equation 3.7
1717
+ | λ 0.8 coverage ratio to declare a match
1718
+ blank |
1719
+ text | Table 3.1: Parameters used in our algorithm.
1720
+ blank |
1721
+ |
1722
+ |
1723
+ |
1724
+ title | 3.3 Learning Phase
1725
+ text | The input to the learning phase is a set of point clouds {I 1 , . . . , I n } obtained from
1726
+ | the same object in different configurations. Our goal is to build a model M consisting
1727
+ | of primitives that are linked by joints. Essentially, the system has to simultaneously
1728
+ | segment the scans into an unknown number of parts, establish correspondence across
1729
+ | different measurements, and extract relative deformations. We simplify the problem
1730
+ | by assuming that each part can be represented by primitives and that each joint
1731
+ | can be encoded with a simple degree of freedom (see also [CZ11]). This assumption
1732
+ | allows us to approximate many man-made objects, while at the same time it leads to
1733
+ | a lightweight model. Note that, unlike Schnabel et al. [SWWK08], who use patches
1734
+ | of partial primitives, our system uses full primitives to represent parts in the learning
1735
+ | phase.
1736
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 42
1737
+ blank |
1738
+ |
1739
+ |
1740
+ text | Initialize the skeleton (Sec. 3.3.1)
1741
+ | Mark stable parts/ Match marked Jointly fit primitives
1742
+ | Update parts
1743
+ | part-groups parts to matched parts
1744
+ blank |
1745
+ |
1746
+ |
1747
+ |
1748
+ text | Incrementally complete the coherent model (Sec. 3.3.2)
1749
+ | Match parts by Jointly fit primitives
1750
+ | Update parts
1751
+ | relative position to matched parts
1752
+ blank |
1753
+ |
1754
+ |
1755
+ |
1756
+ text | Figure 3.4: The learning phase starts by initializing the skeleton model, which is
1757
+ | defined from coherent matches of stable parts. After initialization, new primitives are
1758
+ | added by finding groups of parts at similar relative locations, and then the primitives
1759
+ | are jointly fitted.
1760
+ blank |
1761
+ text | The learning phase starts by detecting large and stable parts to establish a global
1762
+ | reference frame across different measurements I i (Section 3.3.1). The initial corre-
1763
+ | spondences serve as a skeleton of the model, while other parts are incrementally added
1764
+ | to the model until all of the points are covered within threshold θdist (Section 3.3.2).
1765
+ | While primitive fitting is unstable over isolated noisy scans, our system jointly refines
1766
+ | the primitives to construct a coherent model M (see Figure 3.4).
1767
+ | The final model also contains attributes necessary for robust matching. For ex-
1768
+ | ample, the distribution of height from the ground plane provides a prior for tables;
1769
+ | objects can have preferred a repetition direction, e.g., monitors or auditorium chairs
1770
+ | are typically repeated sidewise; or objects can have preferred orientations. These
1771
+ | learned attributes and relationships act as reliable regularizers in the recognition
1772
+ | phase, when data is typically sparse, incomplete, and noisy.
1773
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 43
1774
+ blank |
1775
+ |
1776
+ |
1777
+ title | 3.3.1 Initializing the Skeleton of the Model
1778
+ text | The initial structure is derived from large, stable parts across different measurements,
1779
+ | whose consistent correspondences define the reference frame that aligns the measure-
1780
+ | ments. In the pre-processing stage, individual scans I i are divided into super-points
1781
+ | X i and parts P i , as described in Section 3.2.2. The system then marks the stable
1782
+ | parts of candidate boxes or candidate cylinders.
1783
+ | A candidate face of a box is marked by finding parts with a sufficient number of
1784
+ | super-points:
1785
+ | |Xp | > |P|/Ñp , (3.1)
1786
+ blank |
1787
+ text | where Ñp is a user-defined parameter of the approximate number of primitives in the
1788
+ | model. In our tests, a threshold of 10-20 is used. Parallel planes with comparable
1789
+ | heights are grouped together based on their orientation to constitute the opposite
1790
+ | faces of a box primitive.
1791
+ | The system classifies a part as a candidate cylinder if the ratio of the top two
1792
+ | principle components is greater than 2. Subsequently, parallel cylinders with similar
1793
+ | heights (e.g., legs of chairs) are grouped.
1794
+ | After candidate boxes and cylinders are marked, the system matches the marked
1795
+ | (sometimes grouped) parts for pairs of measurements P i . The system only uses the
1796
+ | consistent matches to define a reference frame between measurements and jointly fit
1797
+ | primitives to the matched parts (see Section 3.3.2).
1798
+ blank |
1799
+ title | Matching
1800
+ blank |
1801
+ text | After extracting the stable parts P i for each measurement, our goal is to match the
1802
+ | parts across different measurements to build a connectivity structure. The system
1803
+ | picks a seed measurement j ∈ {1, 2, ..., n} at random and compare every other mea-
1804
+ | surement against the seed measurement.
1805
+ | Our system then uses spectral correspondences [LH05] to match parts in seed
1806
+ | {p, q} ∈ P k and other {p′ , q ′ } ∈ P i . The system builds an affinity matrix A, where
1807
+ | each entry represents the matching score between part pairs. Recall that candidate
1808
+ | parts p have associated types (box or cylinder), say t(p). Intuitively, the system
1809
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 44
1810
+ blank |
1811
+ |
1812
+ |
1813
+ text | assigns a higher matching score for the parts with the same type t(p) at similar
1814
+ | relative positions. If a candidate assignment a = (p, p′ ) assigns p ∈ P j to p′ ∈ P i , the
1815
+ | corresponding entries are defined as the following:
1816
+ | (
1817
+ | 0 if t(p) 6= t(p′ )
1818
+ | A(a, a) = (3.2)
1819
+ | exp(−(hp − hp′ )2 /2θdist
1820
+ | 2
1821
+ | ) otherwise,
1822
+ blank |
1823
+ text | where our system uses the height from the ground hp as a feature. The affinity value
1824
+ | for a pair-wise assignment between a = (p, p′ ) and b = (q, q ′ ) (p, q ∈ P j and p′ , q ′ ∈ P i )
1825
+ | is defined as:
1826
+ | (
1827
+ | 0 if t(p) 6= t(p′ ) or t(q) 6= t(q ′ )
1828
+ | A(a, b) = ′ ′ 2 (3.3)
1829
+ | exp(− (d(p,q)−d(p
1830
+ | 2θ 2
1831
+ | ,q ))
1832
+ | ) otherwise,
1833
+ | dist
1834
+ blank |
1835
+ |
1836
+ |
1837
+ text | where d(p, q) represents the distance between two parts p, q ∈ P . The system ex-
1838
+ | tracts the most dominant eigenvector of A to establish a correspondence among the
1839
+ | candidate parts.
1840
+ | After comparing the seed measurement P j against all the other measurements P i ,
1841
+ | the system retains those matches only that are consistent across different measure-
1842
+ | ments. The relative positions of the matched parts define the reference frame of the
1843
+ | object as well as the relative transformation between measurements.
1844
+ blank |
1845
+ title | Joint Primitive Fitting
1846
+ blank |
1847
+ text | Our system jointly fits primitives to the grouped parts, while adding necessary defor-
1848
+ | mation. First, the primitive type is fixed by testing for the three types of primitives
1849
+ | (box, cylinder, and rotational structure) and picking the primitive with the smallest
1850
+ | fitting error. Once the primitive type is fixed, the corresponding primitives from other
1851
+ | measurements are averaged and added to the model as a jointly fitted primitive.
1852
+ | Our system uses the coordinate frame to position the fitted primitives. More
1853
+ | specifically, the three orthogonal directions of a box are defined by the frame of
1854
+ | reference defined by the ground direction and the relative positions of the matched
1855
+ | parts. If the normal of the largest observed face does not align with the default frame
1856
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 45
1857
+ blank |
1858
+ |
1859
+ |
1860
+ text | of reference, the box is rotated around an axis to align the large plane. The cylinder
1861
+ | is aligned using its axis, while the rotational primitive is tested when the part is at
1862
+ | the bottom of an object.
1863
+ | Note that unlike a cylinder or a rotational structure, a box can introduce new
1864
+ | faces that are invisible because of the placement rules of objects. For example, the
1865
+ | bottom of a chair seat or the back of a monitor are often missing in the input scans.
1866
+ | Hence, the system retains the information about which of the six faces are visible to
1867
+ | simplify the subsequent recognition phase.
1868
+ | Our system now encodes the inter-primitive connectivity as an edge of the graph
1869
+ | structure. The joints between primitives are added by comparing the relationship
1870
+ | between the parent and child primitives. The first matched primitive acts as a root
1871
+ | to the model graph. Subsequent primitives are the children of the closest primitive
1872
+ | among those already existing in the model. A translational joint is added if the size
1873
+ | of the primitive node varies over different measurements by more than θdist ; or, a
1874
+ | rotational joint is added when the relative angle between the parent and child node
1875
+ | differs by more than 20◦ .
1876
+ blank |
1877
+ |
1878
+ title | 3.3.2 Incrementally Completing a Coherent Model
1879
+ text | Having built an initial model structure, the system incrementally adds primitives by
1880
+ | processing super-points that could not be explained by the primitives. The remaining
1881
+ | super-points are processed to create parts, and the parts are matched based on their
1882
+ | relative positions. Starting from the bottom-most matches, the system jointly fits
1883
+ | primitives to the matched parts, as described above. The system iterates the process
1884
+ | until all super-points in measurements are explained by the model.
1885
+ | If there exist some parts that only exist in a subset of measurements, then the
1886
+ | system adds an attachment of the primitive. For example, in Figure 3.5, after each
1887
+ | side of the rectangular shape of a drawer has been matched, the open drawer is added
1888
+ | as an attachment to the base shape.
1889
+ | The system also maintains the contact point of a model to the ground (or the
1890
+ | bottom-most primitive), the height distribution of each part as histogram, visible face
1891
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 46
1892
+ blank |
1893
+ |
1894
+ |
1895
+ |
1896
+ text | open drawers
1897
+ blank |
1898
+ |
1899
+ |
1900
+ |
1901
+ text | unmatched parts
1902
+ blank |
1903
+ text | Figure 3.5: The open drawers remain as unmatched (grey) after incremental matching
1904
+ | and joint primitive fitting. These parts will be added as an attachment of the model.
1905
+ blank |
1906
+ text | information, and the canonical frame of reference defined during the matching process.
1907
+ | This information, along with the extracted models, is used during the recognition
1908
+ | phase.
1909
+ blank |
1910
+ |
1911
+ title | 3.4 Recognition Phase
1912
+ text | Having learned a set of models (along with their deformation modes) M := {M1 , . . . , Mk }
1913
+ | for a particular environment, the system can quickly collect and understand the envi-
1914
+ | ronment in the recognition phase. This phase is much faster than the learning phase
1915
+ | since there are only a small number of simple primitives and certain deformation
1916
+ | modes from which to search. As an input, the scene S containing the learned models
1917
+ | is collected using the framework from Engelhard et al. [EEH+ 11] which takes a few
1918
+ | seconds. In a pre-processing stage, the system marks the most dominant plane as the
1919
+ | ground plane g. Then, the second most dominant plane that is parallel to the ground
1920
+ | plane is marked as the desk plane d. The system processes the remaining points to
1921
+ | form a hierarchical structure with super-points, parts, and objects (see Section 3.2.2).
1922
+ | The recognition phase starts from a part-based assignment, which quickly com-
1923
+ | pares parts in the measurement and primitive nodes in each model. The algorithm
1924
+ | infers deformation and transformation of the model from the matched parts, while
1925
+ | filtering the valid match by comparing actual measurement against the underlying
1926
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 47
1927
+ blank |
1928
+ |
1929
+ |
1930
+ text | Initial assignments for parts (Sec.3.4.1)
1931
+ | { p1 , p2 ,L} Î oi {m1 , m2 , m3 , l1 , a 3}Î M
1932
+ | p1 = m3
1933
+ | m3
1934
+ | m3 rotational
1935
+ | a3
1936
+ | m2
1937
+ | m2
1938
+ | m1 translational
1939
+ | l1 m1
1940
+ | g
1941
+ | {o1 , o2 ,L}Î S contact g
1942
+ blank |
1943
+ text | Refined assignment with geometry (Sec. 3.4.2)
1944
+ | Solve for deformation Find correspondence
1945
+ | given matches (Sec.5.2.a) and segmentation (Sec.5.2.b)
1946
+ | Iterate p1 = m3
1947
+ | h( p1 ) = h(m3 ) = f h (l1 , a 3 )
1948
+ | n
1949
+ | p2 = m2
1950
+ | n( p1 ) = n(m3 ) = f (a 3 )
1951
+ | p3 = m1
1952
+ blank |
1953
+ text | Figure 3.6: Overview of the recognition phase. The algorithm first finds matched parts
1954
+ | before proceeding to recover the entire model and its corresponding segmentation.
1955
+ blank |
1956
+ text | geometry. If a sufficient portion of measurements can be explained by the model,
1957
+ | the system accepts the match as valid, and the segmentation of both object-level and
1958
+ | part-level is refined to match the model.
1959
+ blank |
1960
+ |
1961
+ title | 3.4.1 Initial Assignment for Parts
1962
+ text | Our system first makes coarse assignments between segmented parts and model nodes
1963
+ | to quickly reduce the search space (see Figure 3.6, top). If a part and a primitive node
1964
+ | form a potential match, the system also induces the relative transformation between
1965
+ | them. The output of the algorithm is a list of triplets composed of part, node from
1966
+ | the model, and transformation groups {(p, m, T )}.
1967
+ | Our system uses geometric features to decide whether individual parts can be
1968
+ | matched with model nodes. Note that the system does not use color information in
1969
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 48
1970
+ blank |
1971
+ |
1972
+ |
1973
+ text | our setting. As features for individual parts Ap , our system considers the following:
1974
+ | (i) height distribution from ground plane as a histogram vector hp ; (ii) three principal
1975
+ | components of of the region x1p , x2p , x3p (x3p = np ); and (iii) sizes along the directions
1976
+ | lp1 > lp2 > lp3 .
1977
+ | Similarly, the system infers the counterpart of features for individual visible faces
1978
+ | of model parts Am . Thus, even if one face of a part is visible from the measurement,
1979
+ | our system is still able to detect the matched part of the model. The height histogram
1980
+ | hm is calculated from the relative area per height interval and the dimensions and
1981
+ | principal components are inferred from the shape of the faces.
1982
+ | All the parts are compared against all the faces of primitive nodes in the model:
1983
+ blank |
1984
+ text | E(Ap , Am ) = (3.4)
1985
+ | ψ height (hp , hm ) · ψ normal (np , nm ; g) · ψ size ({lp1 , lp2 }, {lm
1986
+ | 1 2
1987
+ | , lm }).
1988
+ blank |
1989
+ text | Individual potential function ψ returns either 1 (matched) or 0 (not matched) de-
1990
+ | pending on if a feature satisfies the criteria within an allowable threshold. Parts are
1991
+ | matched only if all the features criteria are satisfied. The height potential calculates
1992
+ | the histogram intersection
1993
+ | X
1994
+ | ψ height (hp , hm ) = min(hp (i), hm (i)) > θheight . (3.5)
1995
+ | i
1996
+ blank |
1997
+ |
1998
+ text | The normal potential calculates the relative angle with the ground plane normal (ng )
1999
+ | as
2000
+ | ψ normal (np , nm ; g) = |acos(np · ng ) − acos(nm · ng )| < θnormal . (3.6)
2001
+ blank |
2002
+ text | The size potential compares the size of the part
2003
+ blank |
2004
+ text | 1
2005
+ | ψ size ({lp1 , lp2 }, {lm 2
2006
+ | , lm }) = |lp1 − lm
2007
+ | 1
2008
+ | | < θsize and |lp2 − lm
2009
+ | 2
2010
+ | | < θsize . (3.7)
2011
+ blank |
2012
+ text | Our system sets the threshold generously to allow false positives and retain multiple
2013
+ | (or none) matched parts per object (see Table 3.1). In effect, the system first guesses
2014
+ | potential object-model associations and later prunes out the incorrect associations
2015
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 49
2016
+ blank |
2017
+ |
2018
+ |
2019
+ text | in the refinement step using the full geometry (see Section 3.4.2). If Equation 3.4
2020
+ | returns 1, then the system can obtain a good estimate of the relative transformation
2021
+ | T between the model and the part by using the position, normal, and the ground
2022
+ | plane direction to create a triplet (p, m, T ).
2023
+ blank |
2024
+ |
2025
+ title | 3.4.2 Refined Assignment with Geometry
2026
+ text | Starting from the list of part, node, and transformation triplets {(p, m, T )}, the sys-
2027
+ | tem verifies the assignments with a full model by comparing a segmented object
2028
+ | o = O(p) against models Mi . The goal is to produce accurate part assignments for
2029
+ | observable parts, transformation, and the deformation parameters. Intuitively, the
2030
+ | system finds a local minimum from the suggested starting point (p, m, T ) with the
2031
+ | help of the models extracted in the learning phase. The system then optimizes by
2032
+ | alternately refining the model pose and updating the segmentation (see Figure 3.6,
2033
+ | bottom).
2034
+ | Given the assignment between p and m, the system first refines the registration and
2035
+ | deformation parameters and places the model M to best explain the measurements.
2036
+ | If the placed model covers most of the points that belong to the object (ratio λ = 0.8
2037
+ | in our tests) within the distance threshold θdist , then the system confirms that the
2038
+ | model is matched to the object. Note that, compared to the generous threshold in
2039
+ | part-matching in Section 5.1, the system now sets a conservative threshold to prune
2040
+ | false-positives.
2041
+ | In the case of a match, the geometry is fixed and the system refines the segmen-
2042
+ | tation, i.e., the part and object boundaries are modified to match the underlying
2043
+ | geometry. The process is iterated until convergence.
2044
+ blank |
2045
+ title | Refining Deformation and Registration
2046
+ blank |
2047
+ text | Our system finds the deformation parameters using the relative location and orien-
2048
+ | tation of parts and the contact plane (e.g., desk top, the ground plane). Given any
2049
+ | pair of parts, or a part and the ground plane, their mutual distance and orientation
2050
+ | are formulated as functions of deformation parameters existing between the path of
2051
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 50
2052
+ blank |
2053
+ |
2054
+ |
2055
+ title | Input points Models matched Parts assigned
2056
+ blank |
2057
+ |
2058
+ |
2059
+ |
2060
+ text | Initial objects Refined objects
2061
+ blank |
2062
+ text | Figure 3.7: The initial object-level segmentation can be imperfect especially between
2063
+ | distant parts. For example, the top and base of a chair initially appeared to be sep-
2064
+ | arate objects, but were eventually understood as the same object after the segments
2065
+ | were refined based on the geometry of the matched model.
2066
+ blank |
2067
+ text | the two parts. For example, if our system starts from matched part-primitive pair p1
2068
+ | and m3 in Figure 3.6, then the height and the normal of the part can be expressed as
2069
+ | function of the deformation parameters l1 and α3 of the model. The system solves a
2070
+ | set of linear equations given for the observed parts and the contact location to solve
2071
+ | for the deformation parameters. Then, the registration between the scan and the
2072
+ | deformed model is refined by Iterative Closest Point (ICP) [BM92].
2073
+ | Ideally, part p in the scene measurement should be explained by the assigned
2074
+ | part geometry within the distance threshold θdist . The model is matched to the
2075
+ | measurement if the proportion of points within θdist is more than λ. (Note that not
2076
+ | all faces of the part need to be explained by the region measurement as only a subset
2077
+ | of the model is measured by the sensor.) Otherwise, the triplet (p, m, T ) is an invalid
2078
+ | assignment and the algorithm returns false. After initial matching (Section 3.4.1),
2079
+ | multiple parts of an object can match to different primitives of many models. If there
2080
+ | are multiple successful matches for an object, the system retains the assignment with
2081
+ | the most number of points.
2082
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 51
2083
+ blank |
2084
+ |
2085
+ |
2086
+ title | Refine Segmentation
2087
+ blank |
2088
+ text | After a model is picked and positioned in the configuration, its location is fixed
2089
+ | while the system refines the segmentation based on the underlying model. Recall
2090
+ | that the initial segment of parts P merge super-points with similar normals and
2091
+ | objects O group neighboring parts using the distance threshold. Although the initial
2092
+ | segmentations provide a sufficient approximation to roughly locate the models, they
2093
+ | do not necessarily coincide with the actual part and object boundaries without being
2094
+ | compared against the geometry.
2095
+ | First, the system updates the association between super-points and the parts by
2096
+ | finding the closest primitive node of the model for each super-point. The super-points
2097
+ | that belong to the same model node are grouped to the same part (see Figure 3.7).
2098
+ | In contrast, super-points that are farther away than the distance threshold θdist from
2099
+ | any of the primitives are separated to form a new segment with a null assignment.
2100
+ | After the part assignment, the system searches for the missing primitives by merg-
2101
+ | ing neighboring objects (see Figure 3.7). In the initial segmentation, objects which
2102
+ | are close to each other in the scene can lead to multiple objects grouped into a sin-
2103
+ | gle segment. Further, particular viewpoints of an object can cause parts within the
2104
+ | model to appear farther apart, leading to spurious multiple segments. Hence, the
2105
+ | super-points are assigned to an object only after the existence of the object is verified
2106
+ | with the underlying geometry.
2107
+ blank |
2108
+ |
2109
+ title | 3.5 Results
2110
+ text | In this section, we present the performance results obtained from testing our system
2111
+ | on various synthetic and real-world scenes.
2112
+ blank |
2113
+ |
2114
+ title | 3.5.1 Synthetic Scenes
2115
+ text | We tested our framework on synthetic scans of 3D scenes obtained from the Google
2116
+ | 3D Warehouse (see Figure 3.8). We implemented a virtual scanner to generate the
2117
+ | synthetic data: once the user specifies a viewpoint, we read the depth buffer to recover
2118
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 52
2119
+ blank |
2120
+ |
2121
+ |
2122
+ text | 3D range data of the virtual scene from the specified viewpoint. We control the scan
2123
+ | quality using three parameters: (i) scanning density d to control the fraction points
2124
+ | that are retained, (ii) noise level g to control the zero mean Gaussian noise added to
2125
+ | each point along the current viewing direction, and (iii) the angle noise a to perturb
2126
+ | the position in the local tangent plane using zero mean Gaussian noise. Unless stated,
2127
+ | we used default values of d = 0.4, g = 0.01, and a = 5◦ .
2128
+ | In Figure 3.8, we present typical recognition results using our framework. The
2129
+ | system learned different models of chairs and placed them with varying deformations
2130
+ | (see Table 3.2). We exaggerated some of the deformation modes, including very
2131
+ | high chairs and severely tilted monitors, but could still reliably detect them all (see
2132
+ | Table 3.3). Beyond recognition, our system reliably recovered both positions and
2133
+ | pose parameters within 5% error margin of the object size. Incomplete data can,
2134
+ | however, result in ambiguities: for example, in synthetic #2 our system correctly
2135
+ | detected a chair, but displayed it in a flipped position, since the scan contained data
2136
+ blank |
2137
+ |
2138
+ |
2139
+ |
2140
+ text | synthetic 1
2141
+ blank |
2142
+ |
2143
+ |
2144
+ |
2145
+ text | synthetic 2
2146
+ blank |
2147
+ |
2148
+ |
2149
+ |
2150
+ text | synthetic 3
2151
+ blank |
2152
+ text | Figure 3.8: Recognition results on synthetic scans of virtual scenes: (left to right) syn-
2153
+ | thetic scenes, virtual scans, and detected scene objects with variations. Unmatched
2154
+ | points are shown in gray.
2155
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 53
2156
+ blank |
2157
+ |
2158
+ |
2159
+ text | only from the chair’s back. While specific volume-based reasoning can be used to
2160
+ | give preference to chairs in an upright position, our system avoided such case-specific
2161
+ | rules in the current implementation.
2162
+ blank |
2163
+ |
2164
+ |
2165
+ |
2166
+ text | similar different
2167
+ blank |
2168
+ |
2169
+ text | Figure 3.9: Chair models used in synthetic scenes.
2170
+ blank |
2171
+ text | In practice, acquired data sets suffer from varying sampling resolution, noise, and
2172
+ | occlusion. While it is difficult to exactly mimic real-world scenarios, we ran synthetic
2173
+ | tests to access the stability of our algorithm. We placed two classes of chairs (see
2174
+ | Figure 3.9) on a ground plane, 70-80 chairs of each type, and created scans from
2175
+ | 5 different viewpoints with varying density and noise parameters. For both classes,
2176
+ | we used our recognition framework to measure precision and recall while varying
2177
+ | parameter λ. Note that precision represents how many of the detected objects are
2178
+ | correctly classified out of total number of detections, while recall represents how many
2179
+ | objects were correctly detected out of the total number of placed objects. In other
2180
+ | words, a precision measure of 1 indicates no false positives, while a recall measure of
2181
+ | 1 indicates there are no false negatives.
2182
+ | Figure 3.10 shows the corresponding precision-recall curves. The first two plots
2183
+ | show precision-recall curves using a similar pair of models, where the chairs have sim-
2184
+ | ilar dimensions, which is expected to result in high false-positive rates (see Figure 3.9,
2185
+ | left). Not surprisingly, recognition improves with a lower noise margin and/or higher
2186
+ | sampling density. Performance, however, is saturated with Gaussian noise lower than
2187
+ | 0.3 and density higher than 0.6 since both our model- and part-based components
2188
+ | are approximations of the true data, resulting in an inherent discrepancy between
2189
+ | measurement and the model, even in absence of noise. Note that as long as the parts
2190
+ | and dimensions are captured, our system still detects objects even under high noise
2191
+ | and sparse sampling.
2192
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 54
2193
+ blank |
2194
+ |
2195
+ |
2196
+ text | Density (a similar pair) Noise (a similar pair) Data type
2197
+ | 1.2 1.2 1.2
2198
+ blank |
2199
+ text | 1 1 1
2200
+ blank |
2201
+ text | 0.8 0.8 0.8
2202
+ blank |
2203
+ |
2204
+ |
2205
+ |
2206
+ text | Recall
2207
+ | Recall
2208
+ | Recall
2209
+ blank |
2210
+ |
2211
+ |
2212
+ |
2213
+ text | 0.6 0.6 0.6
2214
+ blank |
2215
+ text | 0.4 0.4 0.4
2216
+ blank |
2217
+ text | 0.2 0.2 0.2
2218
+ blank |
2219
+ text | 0 0 0
2220
+ | 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2
2221
+ | Precision Precision Precision
2222
+ | density 0.4 density 0.5 Gaussian 0.004 Gaussian 0.008 Gaussian 0.004 Gaussian 0.004
2223
+ | density 0.6 density 0.7 Gaussian 0.3 Gaussian 0.5 Gaussian 0.3 Gaussian 0.3
2224
+ | density 0.8 Gaussian 1.0 Gaussian 2.0 Gaussian 1.0 Gaussian 1.0
2225
+ | Different pair Similar pair
2226
+ blank |
2227
+ text | Figure 3.10: Precision-recall curve with varying parameter λ.
2228
+ blank |
2229
+ text | Our algorithm has higher robustness when the pair of models are sufficiently
2230
+ | different (see Figure 3.10, right). We tested with two pairs of chairs (see Figure 3.9):
2231
+ | the first pair had chairs of similar dimensions as before (in solid lines), while the
2232
+ | second pair had a chair and a sofa with large geometric differences (in dotted lines).
2233
+ | When tested with the different pairs, our system achieved precision higher than 0.98
2234
+ | for recall larger than 0.9. Thus, as long as the geometric space of the objects is sparsely
2235
+ | populated, our algorithm has a high accuracy in quickly acquiring the geometry of
2236
+ | environment without assistance from data-driven or machine-learning techniques.
2237
+ blank |
2238
+ |
2239
+ title | 3.5.2 Real-World Scenes
2240
+ text | The more practical test of our system is its performance on real scanned data since
2241
+ | it is difficult to synthetically recreate all the artifacts encountered during scanning
2242
+ | of a actual physical space. We tested our framework on a range of real-world ex-
2243
+ | amples, each consisting of multiple objects arranged over large spaces (e.g., office
2244
+ | areas, seminar rooms, auditoriums) at a university. For both the learning and the
2245
+ | recognition phases, we acquired the scenes using a Microsoft Kinect scanner with an
2246
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 55
2247
+ blank |
2248
+ |
2249
+ text | points no. of no. of no. of
2250
+ | scene model
2251
+ | per scan scans prim. joints
2252
+ | chair 28445 7 10 4
2253
+ | synthetic1 stool 19944 7 3 2
2254
+ | monitor 60933 7 3 2
2255
+ | chaira 720364 7 9 5
2256
+ | synthetic2
2257
+ | chairb 852072 1 6 0
2258
+ | synthetic3 chair 253548 4 10 2
2259
+ | chair 41724 7 8 4
2260
+ | monitor 20011 5 3 2
2261
+ | office
2262
+ | trash bin 28348 2 4 0
2263
+ | whitebrd. 356231 1 3 0
2264
+ | auditorium chair 31534 5 4 2
2265
+ | seminar rm. chair 141301 1 4 0
2266
+ blank |
2267
+ text | Table 3.2: Models obtained from the learning phase (see Figure 3.11).
2268
+ blank |
2269
+ text | open source scanning library [EEH+ 11]. The scenes were challenging, especially due
2270
+ | to the amount of variability in the individual model poses (see our project page for
2271
+ | the input scans and recovered models). Table 3.2 summarizes all the models built
2272
+ | during the learning stage for these scenes ranging from 3-10 primitives with 0-5 joints
2273
+ | extracted from only a few scans (see Figure 3.11). While we evaluated our framework
2274
+ | based on the raw Kinect output rather than on processed data (e.g., [IKH+ 11]), the
2275
+ | performance limits should be similar when calibrated to the data quality and physical
2276
+ | size of the objects.
2277
+ blank |
2278
+ |
2279
+ |
2280
+ |
2281
+ text | Figure 3.11: Various models learned/used in our test (see Table 3.2).
2282
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 56
2283
+ blank |
2284
+ |
2285
+ |
2286
+ text | Our recognition phase was lightweight and fast, taking on average 200ms to com-
2287
+ | pare a point cluster to a model on a 2.4Hz CPU with 6GB RAM. For example, in
2288
+ | Figure 3.1, our system detected all 5 chairs present and 4 of the 5 monitors, along with
2289
+ | their poses. Note that objects that were not among the learned models remained un-
2290
+ | detected, including a sofa in the middle of the space and other miscellaneous clutter.
2291
+ | We overlaid the unresolved points on the recognized parts for comparison. Note that
2292
+ | our algorithm had access to only the geometry of objects, not any color or texture
2293
+ | attributes. The complexity of our problem setting can be appreciated by looking at
2294
+ | the input scan, which is difficult even for a human to parse visually. We observed
2295
+ | Kinect data to exhibit highly non-linear noise effects that were not simulated in our
2296
+ | synthetic scans; data also went missing when an object was narrow or specular (e.g.,
2297
+ | monitor), with flying pixels along depth discontinuities, and severe quantization noise
2298
+ | for distant objects.
2299
+ | number of input points objects objects
2300
+ | scene
2301
+ | ave. min. max. present detected*
2302
+ | syn. 1 3227 1168 9967 5c 3s 5m 5c 3s 5m
2303
+ | syn. 2 2422 1393 3427 4ca 4cb 4ca 4cb
2304
+ | syn. 3 1593 948 2704 14 chairs 14 chairs
2305
+ | teaser 6187 2575 12083 5c 5m 0t 5c 4m 0t
2306
+ | office 1 3452 1129 7825 5c 2m 1t 2w 5c 2m 1t 2w
2307
+ | office 2 3437 1355 10278 8c 5m 0t 2w 6c 3m 0t 2w
2308
+ | aud. 1 19033 11377 29260 26 chairs 26 chairs
2309
+ | aud. 2 9381 2832 13317 21 chairs 19 chairs
2310
+ | sem. 1 4326 840 11829 13 chairs 11 chairs
2311
+ | sem. 2 6257 2056 12467 18 chairs 16 chairs
2312
+ | *c: chair, m: monitor, t: trash bin, w: whiteboard, s: stool
2313
+ | Table 3.3: Statistics for the recognition phase. For each scene, we also indicate the
2314
+ | corresponding scene in Figure 3.8 and Figure 3.12, when applicable.
2315
+ blank |
2316
+ text | Figure 3.12 compiles the results for cluttered office setups, auditoriums, and sem-
2317
+ | inar rooms. Although we tested with different scenes, we present only representative
2318
+ | examples as the performance on all types of scenes was comparable. Our system
2319
+ | detected the chairs, computer monitors, whiteboards, and trash bins across different
2320
+ | rooms, and the rows of auditorium chairs in different configurations. Our system
2321
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 57
2322
+ blank |
2323
+ |
2324
+ |
2325
+ text | missed some of the monitors because the material property of the screens were proba-
2326
+ | bly not favorable to Kinect capture. The missed monitors (as in Figure 3.1 and office
2327
+ | #2 in Figure 3.12) have big rectangular holes within the screen in the scans. In office
2328
+ | #2, the system also missed two of the chairs that were mostly occluded and beyond
2329
+ | what our framework can handle.
2330
+ | Even under such demanding data quality, our system can recognize the models
2331
+ | and recover poses from data sets an order of magnitude sparser than those required
2332
+ | in the learning phase. Surprisingly, the system could also detect the small tables in
2333
+ | the two auditorium scenes (1 in auditorium #1, and 3 in auditorium #2) and also
2334
+ | identify pose changes in the auditorium seats. Figure 3.13 shows a close-up office
2335
+ | scene to better illustrate the deformation modes that our system captured. All of the
2336
+ | recognized object models have one or more deformation modes, and we can visually
2337
+ | compare the quality of data to the recovered pose and deformation.
2338
+ | The segmentation of real-world scenes are challenging with naturally cluttered
2339
+ | set-ups. The challenge is well demonstrated in the seminar rooms because of closely
2340
+ | spaced chairs or chairs leaning against the wall. In contrast to the auditorium scenes,
2341
+ | where the rows of chairs are detected together making the segmentation trivial, in
2342
+ | the seminar room setting chairs often occlude each other. The quality of data also
2343
+ | deteriorates because of thin metal legs with specular highlights. Nevertheless, our
2344
+ | system correctly recognized most of the chairs along with correct configurations by
2345
+ | first detecting the larger parts. Although only 4-6 chairs were detected in the initial
2346
+ | iteration, our system eventually detected most of chairs in the seminar rooms by
2347
+ | refining the segmentation based on the learned geometry (in 3-4 iterations).
2348
+ blank |
2349
+ |
2350
+ title | 3.5.3 Comparisons
2351
+ text | In the learning phase, our system requires multiple scans of an object to build a proxy
2352
+ | model along with its deformation modes. Unfortunately, the existing public data sets
2353
+ | do not provide such multiple scans. Instead, we compared our recognition routine
2354
+ | to the algorithm proposed by Koppula et al. [KAJS11] using author provided code
2355
+ | to recognize objects from a real-time stream of Kinect data after the user manually
2356
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 58
2357
+ blank |
2358
+ |
2359
+ |
2360
+ text | marks the ground plane. We fixed the device location and qualitatively compared
2361
+ | the recognition results of the two algorithms (see Figure 3.14). We observed that
2362
+ | Koppula et al. reliably detect floors, table tops and front-facing chairs, but often fail
2363
+ | to detect chairs facing backwards, or distant ones. They also miss all the monitors,
2364
+ | which usually are very noisy. In contrast, our algorithm being pose- and variation-
2365
+ | aware is more stable across multiple frames, even with access to less information (we
2366
+ | do not use color). Note that while our system detected some monitors, their poses are
2367
+ | typically biased toward parts where measurements exist. In summary, for partial and
2368
+ | noisy point-clouds, the probabilistic formulation coupled with geometric reasoning
2369
+ | results in robust semantic labeling of the objects.
2370
+ blank |
2371
+ |
2372
+ title | 3.5.4 Limitations
2373
+ text | While in our tests the recognition results were mostly satisfactory (see Table 3.3),
2374
+ | we observed two main failure modes. First, our system failed to detect objects when
2375
+ | large amounts of data were missing. In real-world scenarios, our object scans could
2376
+ | easily exhibit large holes because of occlusions, specular materials, or thin structures.
2377
+ | Further, scans can be sparse and distorted for distant objects. Second, our system
2378
+ | cannot overcome the limitations of our initial segmentation. For example, if objects
2379
+ | are closer than θdist , our system groups them as a single object; while a single object
2380
+ | can be confused for multiple objects if its measurements are separated by more than
2381
+ | θdist from a particular viewpoint. Although in certain cases the algorithm can recover
2382
+ | segmentations with the help of other visible parts, this recovery becomes difficult
2383
+ | because our system allows objects to deform and hence have variable extent.
2384
+ | However, even with these limitations, our system overall reliably recognized scans
2385
+ | with 1000-3000 points per scan since in the learning phase the system extracted
2386
+ | the important degrees of variation, thus providing a compact, yet powerful, model
2387
+ | (and deformation) abstraction. In a real office settings, the simplicity and speed
2388
+ | of our framework would allow a human operator to immediately notice missed or
2389
+ | misclassified objects and quickly re-scan those areas under more favorable conditions.
2390
+ | We believe that such a progressive scanning possibility to become more common place
2391
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 59
2392
+ blank |
2393
+ |
2394
+ |
2395
+ text | in future acquisition setups.
2396
+ blank |
2397
+ |
2398
+ title | 3.5.5 Applications
2399
+ text | Our results suggest that our system is also useful for obtaining a high-level under-
2400
+ | standing of recognized objects, e.g., relative position, orientation, frequency of learned
2401
+ | objects. Specifically, as our system progressively scans multiple rooms populated with
2402
+ | the same objects, the system gathers valuable co-occurrence statistics (see Table 3.4).
2403
+ | For example, from the collected data, the system extracts that the orientation of audi-
2404
+ | torium chairs are consistent (i.e., face a single direction), or observe a pattern among
2405
+ | the relative orientation between a chair and its neighboring monitor. Not surprisingly,
2406
+ | our system found chairs to be more frequent in seminar rooms rather than in offices.
2407
+ | In the future, we plan to incorporate such information to handle cluttered datasets
2408
+ | while scanning similar environments but with differently shaped objects.
2409
+ blank |
2410
+ text | distance (m) angle (◦ )
2411
+ | scene relationship
2412
+ | mean std mean std
2413
+ | chair-chair 1.207 0.555 78.7 74.4
2414
+ | office
2415
+ | chair-monitor 0.943 0.164 152 39.4
2416
+ | aud. chair-chair 0.548 0 0 0
2417
+ | sem. chair-chair 0.859 0.292 34.1 47.4
2418
+ blank |
2419
+ text | Table 3.4: Statistics between objects learned for each scene category.
2420
+ blank |
2421
+ text | As an exciting possibility, the system can efficiently detect change. By change, we
2422
+ | mean introduction of a new object, previously not seen in the learning phase while
2423
+ | factoring out variations due to different spatial arrangements or changes in individual
2424
+ | model poses. For example, in the auditorium #2, a previously unobserved chair
2425
+ | is successfully detected (highlighted in yellow). Such a mode is particularly useful
2426
+ | for surveillance and automated investigation of indoor environments, or for disaster
2427
+ | planning in environments that are unsafe for humans to venture.
2428
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 60
2429
+ blank |
2430
+ |
2431
+ |
2432
+ title | 3.6 Conclusions
2433
+ text | We have presented a simple system for recognizing man-made objects in cluttered 3D
2434
+ | indoor environments, while factoring out low-dimensional deformations and pose vari-
2435
+ | ations, on a scale previously not demonstrated. Our pipeline can be easily extended
2436
+ | to more complex environments primarily requiring reliable acquisition of additional
2437
+ | object models and their variability modes.
2438
+ | Several future challenges and opportunities remain: (i) With an increasing number
2439
+ | of object prototypes, the system will need more sophisticated search data structures
2440
+ | in the recognition phase. We hope to benefit from recent advances in shape search.
2441
+ | (ii) We have focused on a severely restricted form of sensor input, namely, poor and
2442
+ | sparse geometry alone. We intentionally left out color and texture, which can be quite
2443
+ | beneficial, especially if appearance variations can be accounted for. (iii) A natural
2444
+ | extension would be to take the recognized models along with their pose and joint
2445
+ | attributes to create data-driven, high-quality interior CAD models for visualization,
2446
+ | or more schematic representations, that may be sufficient for indoor navigation, or
2447
+ | simply for scene understanding (see Figure 3.1, rightmost image, and recent efforts
2448
+ | in scene modeling [NXS12, SXZ+ 12]).
2449
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 61
2450
+ blank |
2451
+ |
2452
+ |
2453
+ text | office 1 chair monitor
2454
+ | desk
2455
+ blank |
2456
+ |
2457
+ |
2458
+ |
2459
+ text | trash bin whiteboard
2460
+ | office 2
2461
+ blank |
2462
+ |
2463
+ |
2464
+ |
2465
+ text | auditorium 1
2466
+ blank |
2467
+ |
2468
+ |
2469
+ |
2470
+ text | change
2471
+ | auditorium 2 open tables
2472
+ | detection
2473
+ blank |
2474
+ |
2475
+ |
2476
+ |
2477
+ text | open
2478
+ | seat
2479
+ blank |
2480
+ text | seminar room 1
2481
+ blank |
2482
+ |
2483
+ |
2484
+ |
2485
+ text | seminar room 2 missed chairs
2486
+ blank |
2487
+ |
2488
+ |
2489
+ |
2490
+ text | Figure 3.12: Recognition results on various office and auditorium scenes. Since the
2491
+ | input scans have limited viewpoints and thus are too poor to provide a clear represen-
2492
+ | tation of the scene complexity, we include scene images for visualization (these were
2493
+ | unavailable to the algorithm). Note that for the auditorium examples, our system
2494
+ | even detected the small tables attached to the chairs — this was possible since the
2495
+ | system extracted this variation mode in the learning phase.
2496
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 62
2497
+ blank |
2498
+ |
2499
+ |
2500
+ |
2501
+ text | missed monitor laptop monitor
2502
+ | chair
2503
+ blank |
2504
+ |
2505
+ |
2506
+ |
2507
+ text | drawer deformations
2508
+ blank |
2509
+ text | Figure 3.13: A close-up office scene. All of the recognized objects have one or more
2510
+ | deformation modes. The algorithm inferred the angles of the laptop screen and the
2511
+ | chair back, heights of the chair seat, the arm rests and the monitor. Note that our
2512
+ | system also captured the deformation modes of open drawers.
2513
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 63
2514
+ blank |
2515
+ |
2516
+ |
2517
+ |
2518
+ text | input scene 1 input scene 2
2519
+ blank |
2520
+ |
2521
+ |
2522
+ |
2523
+ text | shifted wrong labels
2524
+ blank |
2525
+ |
2526
+ |
2527
+ |
2528
+ text | missed missed
2529
+ blank |
2530
+ |
2531
+ |
2532
+ |
2533
+ text | [Koppula et al.] ours [Koppula et al.] ours
2534
+ | table top wall floor chair base table leg monitor chair back
2535
+ blank |
2536
+ text | Figure 3.14: We compared our algorithm and Koppula et al. [KAJS11] using multiple
2537
+ | frames of scans from the same viewpoint. Our recognition results are more stable
2538
+ | across different frames.
2539
+ meta | Chapter 4
2540
+ blank |
2541
+ title | Guided Real-Time Scanning of
2542
+ | Indoor Objects3
2543
+ blank |
2544
+ text | Acquiring 3-D models of the indoor environments is a critical component for under-
2545
+ | standing and mapping the environments. For successful 3-D acquisition in indoor
2546
+ | scenes, it is necessary to simultaneously scan the environment, interpret the incom-
2547
+ | ing data stream, and plan subsequent data acquisition, all in a real-time fashion. The
2548
+ | challenge is, however, that individual frames from portable commercial 3-D scanners
2549
+ | (RGB-D cameras) can be of poor quality. Typically, complex scenes can only be
2550
+ | acquired by accumulating multiple scans. Information integration is done in a post-
2551
+ | scanning phase, when such scans are registered and merged, leading eventually to
2552
+ | useful models of the environment. Such a workflow, however, is limited by the fact
2553
+ | that poorly scanned or missing regions are only identified after the scanning process
2554
+ | is finished, when it may be costly to revisit the environment being acquired to per-
2555
+ | form additional scans. In the study presented in this chapter, we focused on real-time
2556
+ | 3D model quality assessment and data understanding, that could provide immediate
2557
+ | feedback for guidance in subsequent acquisition.
2558
+ | Evaluating acquisition quality without having any prior knowledge about an un-
2559
+ | known environment, however, is an ill-posed problem. We observe that although the
2560
+ meta | 3
2561
+ text | The contents of the chapter will be published as Y.M. Kim, N. Mitra, Q. Huang, L. Guibas,
2562
+ | Guided Real-Time Scanning of Indoor Environments, Pacific Graphics 2013.
2563
+ blank |
2564
+ |
2565
+ |
2566
+ meta | 64
2567
+ | CHAPTER 4. GUIDED REAL-TIME SCANNING 65
2568
+ blank |
2569
+ |
2570
+ |
2571
+ text | target scene itself maybe unknown, in many cases, the scene consists of objects from
2572
+ | a well-prescribed pre-defined set of object categories. Moreover, these categories are
2573
+ | well represented in publicly available 3-D shape repositories (e.g., Trimble 3D Ware-
2574
+ | house). For example, an office setting typically consists of various tables, chairs,
2575
+ | monitors, etc., all of which have thousands of instances in the Trimble 3D Ware-
2576
+ | house. In our approach, instead of attempting to reconstruct detailed 3D geometry
2577
+ | from low-quality inconsistent 3D measurements, we focus on parsing the input scans
2578
+ | into simpler geometric entities, and use existing 3D model repositories like Trimble
2579
+ | 3D warehouse as proxies to assist the process of assessing data quality. Thus, we
2580
+ | defined two key tasks that an effective acquisition method would need to complete:
2581
+ | (i) given a partially scanned object, reliably and efficiently retrieve appropriate proxy
2582
+ blank |
2583
+ |
2584
+ |
2585
+ |
2586
+ text | Figure 4.1: We introduce a real-time guided scanning system. As streaming 3D
2587
+ | data is progressively accumulated (top), the system retrieves the top matching mod-
2588
+ | els (bottom) along with their pose to act as geometric proxies to assess the current
2589
+ | scan quality, and provide guidance for subsequent acquisition frames. Only a few
2590
+ | intermediate frames with corresponding retrieved models are shown in this figure.
2591
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 66
2592
+ blank |
2593
+ |
2594
+ |
2595
+ text | models of it from the database; and (ii) position the retrieved models in the scene
2596
+ | and provide real-time feedback (e.g., missing geometry that still needs to be scanned)
2597
+ | to guide subsequent data gathering.
2598
+ | We introduce a novel partial shape retrieval approach for finding similar shapes
2599
+ | of a query partial scan. In our setting, we used the Microsoft Kinect to acquire
2600
+ | the scans of real objects. The proposed approach, which combines both descriptor-
2601
+ | based retrieval and registration-based verification, is able to search in a database of
2602
+ | thousands of models in real-time. To account for partial similarity between the input
2603
+ | scan and the models in a database, we created simulated scans of each database model
2604
+ | and compared a scan of real setting to a scan of simulated setting. This allowed us to
2605
+ | efficiently compare shapes using global descriptors even in the presence of only partial
2606
+ | similarity; and the approach remains robust in the case of occlusions or missing data
2607
+ | about the object being scanned.
2608
+ | Once our system finds a match, to mark out missing parts in the current merged
2609
+ | scan, the system aligns it with the retrieved model and highlights the missing part
2610
+ | or places where the scan density is low. This visual feedback allows the operator
2611
+ | to quickly adjust the scanning device for subsequent scans. In effect, our 3D model
2612
+ | database and matching algorithms make it possible for the operator to assess the
2613
+ | quality of the data being acquired and discover badly scanned or missing areas while
2614
+ | the scan is being performed, thus allowing corrective actions to be taken immediately.
2615
+ | We extensively evaluated the robustness and accuracy of our system using syn-
2616
+ | thetic data sets with available ground truth. Further, we tested our system on physical
2617
+ | environments to achieve real-time scene understanding (see the supplementary video
2618
+ | that includes the actual scanning session recorded). In summary, in this chapter, we
2619
+ | present a novel guided scanning interface and introduce a relation-based light-weight
2620
+ | descriptor for fast and accurate model retrieval and positioning to provide real-time
2621
+ | guidance for scanning.
2622
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 67
2623
+ blank |
2624
+ |
2625
+ |
2626
+ title | 4.1 Related Work
2627
+ blank |
2628
+ title | 4.1.1 Interactive Acquisition
2629
+ text | Fast, accurate, and autonomous model acquisition have long been primary goals in
2630
+ | robotics, computer graphics, and computer vision. With the introduction of afford-
2631
+ | able, portable, commercial RGBD cameras, there has been a pressing need to simplify
2632
+ | scene acquisition workflows to allow less experienced individuals to acquire scene ge-
2633
+ | ometries. Recent efforts fall into two broad categories: (i) combining individual
2634
+ | frames of low-quality point-cloud data with SLAM algorithms [EEH+ 11, HKH+ 12] to
2635
+ | improve scan quality [IKH+ 11]; (ii) using supervised learning to train classifiers for
2636
+ | scene labeling [RBF12] with applications to robotics [KAJS11]. Previously, [RHHL02]
2637
+ | aggregated scans at interactive rates to provide visual feedback to the user. This work
2638
+ | was recently expanded by [DHR+ 11]. [KDS+ 12] extracted simple planes and recon-
2639
+ | struct floor plans with guidance from a projector pattern. While our goal is also to
2640
+ | provide real-time feedback, our system differs from previous efforts in that it uses
2641
+ | retrieved proxy models to automatically access the current scan quality, enabling
2642
+ | guided scanning.
2643
+ blank |
2644
+ |
2645
+ title | 4.1.2 Scan Completion
2646
+ text | Various strategies have been proposed to improve noisy scans or plausibly fill in miss-
2647
+ | ing data due to occlusion: researchers have exploited repetition [PMW+ 08], symme-
2648
+ | try [TW05, MPWC12], or used primitives to complete missing parts [SWK07]. Other
2649
+ | approaches have focused on using geometric proxies and abstractions including curves,
2650
+ | skeletons, planar abstractions, etc. In the context of image understanding, indoor
2651
+ | scenes have been abstracted and modeled as a collection of simple cuboids [LGHK10,
2652
+ | ZCC+ 12] to capture a variety of man-made objects.
2653
+ blank |
2654
+ |
2655
+ title | 4.1.3 Part-Based Modeling
2656
+ text | Simple geometric primitives, however, are not always sufficiently expressive for com-
2657
+ | plex shapes. Meanwhile, such objects can still be split into simpler parts that aid
2658
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 68
2659
+ blank |
2660
+ |
2661
+ |
2662
+ text | shape understanding. For example, parts can act as entities for discovering rep-
2663
+ | etitions [TSS10], training classifiers [SFC+ 11, XS12], or facilitating shape synthe-
2664
+ | sis [JTRS12]. Alternately, a database of part-based 3D model templates can be used
2665
+ | to detect shapes from incomplete data [SXZ+ 12, NXS12, KMYG12]. Such methods
2666
+ | often rely on expensive matching, and thus do not lend themselves to low-memory
2667
+ | footprint real-time realizations.
2668
+ blank |
2669
+ |
2670
+ title | 4.1.4 Template-Based Completion
2671
+ text | Our system also uses database of 3D models (e.g., chairs, lamps, tables) to retrieve
2672
+ | shape from 3D scans. However, by defining a novel simple descriptor, our sys-
2673
+ | tem, compared to previous efforts, can reliably handle much larger model databases.
2674
+ | Specifically, instead of geometrically matching templates [HCI+ 11], or using templates
2675
+ | to complete missing parts [PMG+ 05], our system initially searches for consistency in
2676
+ | distribution of relation among primitive faces.
2677
+ blank |
2678
+ |
2679
+ title | 4.1.5 Shape Descriptors
2680
+ text | In the context of shape retrieval, various descriptors have been investigated for group-
2681
+ | ing, classification, or retrieval of 3D geometry. For example, the method proposed by
2682
+ | [CTSO03] uses light-field descriptors based on silhouettes, the method by [OFCD02]
2683
+ | uses shape distributions to categorize different object classes, etc. The silhouette
2684
+ | method requires an expensive rotational alignment search, limiting its usefulness in
2685
+ | our setting to a small number of models (100-200). Both methods assume access
2686
+ | to nearly complete models to match against. In contrast, for guided scanning, our
2687
+ | approach can support much larger model sets (about 2000 models) and, more impor-
2688
+ | tantly, focus on handling poor and incomplete point sets as inputs to the matcher.
2689
+ blank |
2690
+ |
2691
+ title | 4.2 Overview
2692
+ text | Figure 4.2 illustrates the pipeline of our guided real-time scanning system, which con-
2693
+ | sists of a scanning device (Kinect in our case) and a database of 3D shapes containing
2694
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 69
2695
+ blank |
2696
+ |
2697
+ |
2698
+ |
2699
+ text | Off-line process
2700
+ blank |
2701
+ text | Database of Simulated Similarity
2702
+ | A2h descriptor
2703
+ | 3D models scans measure
2704
+ blank |
2705
+ |
2706
+ |
2707
+ text | Retrieved
2708
+ | shape
2709
+ | …
2710
+ blank |
2711
+ |
2712
+ |
2713
+ |
2714
+ text | …
2715
+ | …
2716
+ blank |
2717
+ |
2718
+ |
2719
+ |
2720
+ text | registered Density voxel
2721
+ blank |
2722
+ text | Retrieved
2723
+ | model +
2724
+ | Segmented, pose
2725
+ | Frames of
2726
+ | registered A2h descriptor Align shape
2727
+ | measurement
2728
+ | pointcloud
2729
+ | …
2730
+ blank |
2731
+ |
2732
+ |
2733
+ |
2734
+ text | registered Densityvoxel provide
2735
+ | guidance
2736
+ blank |
2737
+ |
2738
+ text | Figure 4.2: Pipeline of the real-time guided scanning framework.
2739
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 70
2740
+ blank |
2741
+ |
2742
+ |
2743
+ text | the categories of the shapes present in the environment. In each iteration, the sys-
2744
+ | tem performs three tasks: (i) scan acquisition from a set of viewpoints specified by a
2745
+ | user (or a planning algorithm); (ii) shape retrieval using distribution of relations; and
2746
+ | (iii) comparison of the scanned pointset with the best retrieved model. The system
2747
+ | iterates these steps until a sufficiently good match is found (see supplementary video).
2748
+ | The challenge is how to maintain real-time response.
2749
+ blank |
2750
+ |
2751
+ title | 4.2.1 Scan Acquisition
2752
+ text | The input stream of a real-time depth sensor (in our case, the Kinect was used) is col-
2753
+ | lected and processed using an open-source implementation [EEH+ 11] that calibrates
2754
+ | the color and depth measurements and outputs the pointcloud data. The color fea-
2755
+ | tures of individual frames are then extracted and matched from consecutive frames.
2756
+ | The corresponding depth values are used to incrementally register the depth mea-
2757
+ | surements [HKH+ 12]. The pointcloud that belongs to the object is segmented as the
2758
+ | system detects the ground plane and exclude the points that belong to the plane. We
2759
+ | will refer to the segmented, registered set of depth measurements as a merged scan.
2760
+ | Whenever each new frame is processed, the system calculates the descriptor and the
2761
+ | density voxels from the pointcloud data for the merged scan.
2762
+ blank |
2763
+ |
2764
+ title | 4.2.2 Shape Retrieval
2765
+ text | Our goal is to find shapes in the database that are similar to the merged scan. Since
2766
+ | the merged scan may contain only partial information about the object being scanned,
2767
+ | our system internally generates simulated views of both the merged scan as well as
2768
+ | shapes in the database, and then compare their point clouds associated with these
2769
+ | views. The key observation is that although the merged scan may still have missing
2770
+ | geometry, it is likely that it contains all the visible geometry of the object being
2771
+ | scanned when the object is viewed from a particular point of view (i.e., the self-
2772
+ | occlusions are predictable); it thus becomes comparable to database model views
2773
+ | from the same or nearby viewpoints. Hence, the system measures shape similarity
2774
+ | between such point-cloud views. For shape retrieval, our system first performs a
2775
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 71
2776
+ blank |
2777
+ |
2778
+ |
2779
+ text | descriptor-based similarity search against the entire database to obtain a candidate
2780
+ | set of similar models. Finally, the system performs registration of each model with
2781
+ | the merged scan and returns the model with the best alignment score.
2782
+ | We note here that past research on global shape descriptors has mostly focused on
2783
+ | broad differentiation of shape classes, e.g., separating shapes of vehicles from those
2784
+ | of furniture or of people, etc. In our case, since the system is looking for potentially
2785
+ | modest amounts of missing geometry in the scans, we aim more for fine variability
2786
+ | differentiation among a particular object class, such as chairs. We have therefore
2787
+ | developed and exploited a novel histogram descriptor based on the angles between
2788
+ | the shape normals for this task (see Section 4.3.2).
2789
+ blank |
2790
+ |
2791
+ title | 4.2.3 Scan Evaluation
2792
+ text | Once the retrieved model is computed, the retrieved proxy is displayed for the user.
2793
+ | The system also highlights voxels with missing data when compared with the best
2794
+ | matching model, and finishes when the retrieved best match model is close enough to
2795
+ | the current measurement (when the missing voxels are less than 1% of total number
2796
+ | of voxels). In Section 4.3.4, we elaborate on this guided scanning interface.
2797
+ blank |
2798
+ |
2799
+ title | 4.3 Partial Shape Retrieval
2800
+ text | Our goal is to quickly assess the quality of the current scan and guide the user in
2801
+ | subsequent scans. This is challenging on the following counts: (i) the system has
2802
+ | to assess model quality without necessarily knowing which model is being scanned;
2803
+ | (ii) the scans are potentially incomplete, with large parts of data missing; and (iii) the
2804
+ | system should respond in real-time.
2805
+ | We observe that existing database models such as Trimble 3D warehouse models
2806
+ | can be used as proxies for evaluating scan quality of similar objects being scanned,
2807
+ | thus addressing the first challenge. Hence, for any merged query scan (i.e., point-
2808
+ | cloud) S, the system looks for a match among similar models in the database M =
2809
+ | {M1 , · · · MN }. For simplicity, we assume that the up-right orientation of each model
2810
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 72
2811
+ blank |
2812
+ |
2813
+ |
2814
+ text | in the model database is available in existing database.
2815
+ | To handle the second challenge, we note that missing data, even in large chunks,
2816
+ | are mostly the result of self occlusion, and hence are predictable. To address this
2817
+ | problem, our system synthetically scans the models Mi from different viewpoints to
2818
+ | simulate such self occlusions. This greatly simplifies the problem by allowing us to
2819
+ | directly compare S to the simulated scans of Mi , thus automatically accounting for
2820
+ | missing data in S.
2821
+ | Finally, to achieve real-time performance, we propose a simple, robust, yet effective
2822
+ | descriptor to match S to view-dependent scans of Mi . Subsequently, the system
2823
+ | performs registration to verify the match between each matched simulated scan and
2824
+ | the query scan, and returns the most similar simulated scan and the corresponding
2825
+ | model Mi . The following subsections provide further details of the each step for
2826
+ | partial shape retrieval.
2827
+ blank |
2828
+ |
2829
+ title | 4.3.1 View-Dependent Simulated Scans
2830
+ text | For each model Mi , the system generates simulated scans S k (Mi ) from multiple cam-
2831
+ | era positions. Let dup denote the up-right orientation for model Mi . Our system takes
2832
+ | dup as the z-axis and arbitrarily fixes any orthogonal direction di (i.e., dTi dup = 0) as
2833
+ | the x-axis. The system also translates the centroid of Mi to the origin.
2834
+ | The system then virtually positions the cameras at the surface of a view-sphere
2835
+ | around the origin. Specifically, the camera is placed at
2836
+ blank |
2837
+ text | ci := (2d cos θ sin φ, 2d sin θ sin φ, 2d cos φ)
2838
+ blank |
2839
+ text | where d denotes the length of the diagonal of the bounding box of Mi , and φ denotes
2840
+ | the camera altitude. The camera up-vector is defined as
2841
+ blank |
2842
+ text | dup − < dup , ci > ci
2843
+ | ui := with ci = ci /kci k
2844
+ | kdup − < dup , ci > ci k
2845
+ blank |
2846
+ text | and the gaze point is defined as the origin. The fields of view are set to π/2 in both
2847
+ | the up and horizontal directions.
2848
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 73
2849
+ blank |
2850
+ |
2851
+ |
2852
+ text | For each such camera location, our system obtains a synthetic scan using the z-
2853
+ | buffer with a grid setting of 200 × 200. Such a grid results in vertices where the grid
2854
+ | rays intersect the model. The system generates the simulated scan by computing one
2855
+ | surfel (pf , nf , df ) (i.e., a point, normal, and density, respectively) from each quad
2856
+ | face f = (qf 1 , qf 2 , qf 3 , qf 4 ), as follows,
2857
+ blank |
2858
+ text | 4
2859
+ | X X
2860
+ | pf := qf i /4, nf := nijk /4, (4.1)
2861
+ | i=1 ijk∈{123,234,341,412}
2862
+ | X
2863
+ | df := 1/ area(qf i , qf j , qf k ) (4.2)
2864
+ | ijk∈{123,234,341,412}
2865
+ blank |
2866
+ |
2867
+ text | where, nijk denotes the normal of the triangular face (qf i , qf j , qf k ) and nf ← nf /knf k.
2868
+ | Thus the simulated scans simply collects surfels generated from all the quad faces of
2869
+ | the sampling grid.
2870
+ | Our system places K samples of θ, i.e., θ = 2kπ/K where k ∈ [0, K) and φ =
2871
+ | {π/6, π/3} to obtain view-dependent simulated scans for each model Mi . Empirically,
2872
+ | we set K = 6 to balance between efficiency and quality when comparing simulated
2873
+ | scans and the merged scan S.
2874
+ blank |
2875
+ |
2876
+ title | 4.3.2 A2h Scan Descriptor
2877
+ text | Our goal is to design a descriptor that (i) is efficient to compute, (ii) is robust to
2878
+ | noise and outliers, and (iii) has a low-memory footprint. We draw inspiration from
2879
+ | shape distributions [OFCD02] that computes statistics about geometric quantities
2880
+ | that are invariant to global transforms, e.g., distances between pairs of points on
2881
+ | the models. Shape distribution descriptors, however, were designed to be resilient to
2882
+ | local geometric changes. Hence, they are ineffective in our setting, where shapes are
2883
+ | distinguished by subtle local features. Instead, our system computes the distributions
2884
+ | of angles between point normals, which better capture the local geometric features.
2885
+ | Further, since the system knows the upright direction of each shape,this information
2886
+ | is incorporated into the design of the descriptor.
2887
+ | Specifically, for each scan S (real or simulated), our system first allocates the
2888
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 74
2889
+ blank |
2890
+ |
2891
+ |
2892
+ text | points into three bins based on their height along the z-axis, i.e., the up-right direction.
2893
+ | Then, among the points within each bin, the system computes the distribution of
2894
+ | angles between normals of all pairs of points. The angle space is discretized using 50
2895
+ | bins between [0, π], e.g., each bin counts the frequency of normal angles within each
2896
+ | bin. We call this the A2h scan descriptor, which for each point cloud is a 50 × 3 = 150
2897
+ | dimensional vector; this collects the angle distribution within each height bin.
2898
+ | In practice, for pointclouds belonging to any merged scan, our system randomly
2899
+ | samples 10, 000 pairs of points within each height bin to speed-up the computation. In
2900
+ | our extensive tests, we found this simple descriptor to perform better than distance-
2901
+ | only histograms in distinguishing fine variability within a broad shape class (see
2902
+ | Figure 4.3).
2903
+ blank |
2904
+ |
2905
+ title | 4.3.3 Descriptor-Based Shape Matching
2906
+ text | A straightforward way to compare two descriptor vectors f1 of f2 is to take the Lp
2907
+ | norm of their difference vector f1 − f2 . However, the Lp norm can be sensitive to
2908
+ | noise and does not account for the similarity of distribution between similar curves.
2909
+ | Instead, our system uses the Earth Mover’s distance (EMD) to compare a pair of
2910
+ | distributions [RTG98]. Intuitively, given two distributions, one distribution can be
2911
+ | seen as a mass of earth properly spread in space, the other distribution as a collection
2912
+ | of holes that need to be filled with that earth. Then, the EMD measures the least
2913
+ | amount of work needed to fill the holes with earth. Here, a unit of work corresponds to
2914
+ | transporting a unit of earth by a unit of ground distance. The costs of “moving earth”
2915
+ | reflect the notion of nearness between bins; therefore the distortion, due to noise is
2916
+ | minimized. In a 1D setting, EMD with L1 norms is equivalent to calculating an L1
2917
+ | norm for cumulative distribution functions (CDF) of the distribution [Vil03]. Hence,
2918
+ | our system achieves robustness to noise at the same time complexity as calculating
2919
+ | an L1 norm between the A2h distributions. For all of the results presented below, our
2920
+ | system used EMD with L1 norms of the CDFs computed from the A2h distributions.
2921
+ | Because there are 2K view-dependent pointclouds associated with each model Mi ,
2922
+ | the system matches the query S with each such pointcloud S k (Mi ) (k = 1, 2, ..., 2K)
2923
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 75
2924
+ blank |
2925
+ |
2926
+ |
2927
+ text | and records the best matching score. In the end, the system returns the top 25
2928
+ | matches across the models in M.
2929
+ blank |
2930
+ |
2931
+ title | 4.3.4 Scan Registration
2932
+ text | Our system overlays the retrieved model Mi over merged scan S as follows: the system
2933
+ | first aligns the centroid of the simulated scan S k (Mi ) to match the centroid of S (note
2934
+ | that we do not force the model Mi to touch the ground), while scaling model Mi to
2935
+ | match the data. To fix the remaining 1DOF rotational ambiguity, the angle space is
2936
+ | discretized into 10◦ intervals, and the system picks the angle for which the rotated
2937
+ | model best matches the scan S. In practice, we found this refinement step necessary
2938
+ | since our view-dependent scans have coarse angular resolution (K = 6).
2939
+ | Finally, the system uses the positioned proxy model Mi to assess the quality of the
2940
+ | current scan. Specifically, the bounding box of Mi is discretized into 9 × 9 × 9 voxels
2941
+ | and the density of points that falls within the voxel location is calculated. Those
2942
+ | voxels are highlighted where the matched model has high density of points (more
2943
+ | than the average) but where there are insufficient points coming from the scan S,
2944
+ | thus providing guidance for subsequent acquisitions. The process is terminated when
2945
+ | there is less than 10 such highlighted voxels, and the best matching model is simply
2946
+ | displayed.
2947
+ blank |
2948
+ |
2949
+ title | 4.4 Interface Design
2950
+ text | The real-time system guides the user to scan an object and retrieve the closest match.
2951
+ | In our study, we used the Kinect scanner for the acquisition and the retrieval process
2952
+ | took 5-10 seconds/iteration on our unoptimized implementation. The user scans an
2953
+ | object from an operating distance of about 1-3m. The sensor data of real-time video
2954
+ | stream of depth pointcloud and color images are visible to the user at all times (see
2955
+ | Figure 4.4).
2956
+ | The user starts scanning by pointing the sensor to the ground plane. The ground
2957
+ | plane is detected if the sensor captures a dominant plane that covers more than 50% of
2958
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 76
2959
+ blank |
2960
+ |
2961
+ |
2962
+ text | the scene. Our system uses this plane to extract the upright direction of the captured
2963
+ | scene. When the ground plane is successfully detected, the user receives an indication
2964
+ | on the screen(Figure 4.4 top-right).
2965
+ | In a separate window, the pointcloud data corresponding to the object being cap-
2966
+ | tured is continuously displayed. The system registers the points using image features
2967
+ | and segments the object by extracting the groundplane. The displayed pointcloud
2968
+ | data is also used to calculate the descriptor and the voxel density. At the end of
2969
+ | the retrieval stage (see Section 4.3), the system retains the information between the
2970
+ | closest match of the model and the current pointcloud data. The pointcloud is over-
2971
+ | laid with two additional cues: (i) missing data in voxels as compared with the closest
2972
+ | matched model, and (ii) the 3D model of the closest match of the object. Based on
2973
+ | this guidance, the user can then acquire the next scan. The system automatically
2974
+ | stops when the matched model is similar to the captured pointcloud.
2975
+ blank |
2976
+ |
2977
+ title | 4.5 Evaluation
2978
+ text | We tested the robustness of the proposed A2h descriptor on synthetically generated
2979
+ | data against available groundtruth. Further, we let novice users use our system
2980
+ | to scan different indoor environments. The real-time guidance allowed the users to
2981
+ | effectively capture the indoor scenes (see supplementary video).
2982
+ blank |
2983
+ text | dataset # models average # points/scan
2984
+ | chair 2138 45068
2985
+ | couch 1765 129310
2986
+ | lamp 1805 11600
2987
+ | table 5239 61649
2988
+ blank |
2989
+ text | Table 4.1: Database and scan statistics.
2990
+ blank |
2991
+ |
2992
+ |
2993
+ title | 4.5.1 Model Database
2994
+ text | We considered four categories of objects (i.e., chairs, couches, lamps, tables) in our
2995
+ | implementation. For each category, we downloaded a large number of models from
2996
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 77
2997
+ blank |
2998
+ |
2999
+ |
3000
+ text | the Trimble 3D Warehouse (see Table 4.1) to act as proxy geometry in the online
3001
+ | scanning phase. The models were pre-scaled and moved to the origin. We syntheti-
3002
+ | cally scanned each such model from 12 different viewpoints and computed the A2h
3003
+ | descriptor for each such scan. Note that we placed the camera only above the objects
3004
+ | (altitudes of π/6 and π/3) as the input scans rarely capture the underside of the ob-
3005
+ | jects. We used the Kinect scanner to gather streaming data and used an open source
3006
+ | library [EEH+ 11] to accumulate the input data to produce merged scans.
3007
+ blank |
3008
+ |
3009
+ title | 4.5.2 Retrieval Results with Simulated Data
3010
+ text | The proposed A2h descriptor is effective in retrieving similar shapes in fractions of
3011
+ | seconds. Figure 4.5, 4.6, 4.7, and 4.8 show typical retrieval results. In our tests, we
3012
+ | found the retrieval results to be useful for chairs and couches, which have a wider
3013
+ | variation of angles compared to lamps or tables, the shape of which is almost always
3014
+ | very symmetric.
3015
+ blank |
3016
+ title | Effect of Viewpoints
3017
+ blank |
3018
+ text | The scanned data often have significant parts missing, mainly due to self-occlusion.
3019
+ | We simulated this effect on the A2h descriptor-based retrieval and compared the
3020
+ | performance against retrieval with merged (simulated) scans, Figure 4.9. We found
3021
+ | the retrieval results to be robust and the models sufficiently representative to be used
3022
+ | as proxies for subsequent model assessment.
3023
+ blank |
3024
+ title | Comparison with Other Descriptors
3025
+ blank |
3026
+ text | We also tested existing shape descriptors: silhouette-based light field descriptor [CTSO03],
3027
+ | local spin image [Joh97], and the D2 descriptor [OFCD02]. In all the cases, we found
3028
+ | our A2h descriptor to be more effective in quickly resolving local geometric changes,
3029
+ | particularly for low quality partial pointclouds. In contrast, we found the light field
3030
+ | descriptor to be more susceptible to noise, local spin image more expensive to com-
3031
+ | pute, and the D2 descriptor less able to distinguish between local variations than our
3032
+ | A2h descriptor (see Figure 4.3).
3033
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 78
3034
+ blank |
3035
+ |
3036
+ |
3037
+ text | We next evaluated the degradation in the retrieval results under perturbations in
3038
+ | sampling density and noise.
3039
+ blank |
3040
+ title | Effect of Density
3041
+ blank |
3042
+ text | During scanning, points are sampled uniformly on the sensor grid, instead of uniformly
3043
+ | on the model surface. This uniform sampling on the sensor grid results in varying
3044
+ | densities of scanned points depending on the viewpoint. Our system compensates for
3045
+ | this effect by assigning probabilities that are inversely proportional to the density of
3046
+ | sample points.
3047
+ | Figure 4.10 shows the effect of density compensation on the histogram distribu-
3048
+ | tions. We tested two different combination of viewpoints and compared the distribu-
3049
+ | tions, using sampling based on uniform distribution or inversely proportional to the
3050
+ | density. Density-aware sampling are indicated by dotted lines. The overall shapes
3051
+ | of the graphs are similar for uniform and density-aware samplings. However, the ab-
3052
+ | solute values on the peaks are observed at similar heights while using density-aware
3053
+ | sampling. Hence, our system uses density-aware sampling to achieve robustness to
3054
+ | sampling variations.
3055
+ blank |
3056
+ title | Effect of Noise
3057
+ blank |
3058
+ text | In Figure 4.11, we show the robustness of A2h histograms under noise. Generally, the
3059
+ | histograms become smoother under increasing noise as subtle orientation variations
3060
+ | get masked. For reference, the Kinect measurements from a distance range of 1-2m
3061
+ | have noise perturbations comparable to 0.005 noise in the simulated data. We added
3062
+ | synthetic Gaussian noise on the simulated data to calculate the A2h descriptors to
3063
+ | better simulate the shape of the histogram.
3064
+ blank |
3065
+ |
3066
+ title | 4.5.3 Retrieval Results with Real Data
3067
+ text | Figure 4.12 shows retrieval results on a range of objects (i.e., chairs, couches, lamps,
3068
+ | and tables). Overall we found the guided interface to work well in practice. The
3069
+ | performance was better for chairs and couches, while for lamps and tables, the thin
3070
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 79
3071
+ blank |
3072
+ |
3073
+ |
3074
+ text | structures led to some failure cases. In all cases, the system successfully handled
3075
+ | missing data as high as 40-60% of the object surface (or half of the object surface
3076
+ | invisible) and the response of the system was at interactive rates. Note that for
3077
+ | testing purposes we manually pruned the input database models to leave out models
3078
+ | (if any) that looked very similar to the target objects to be scanned. Please refer to
3079
+ | the supplementary video for the system in action.
3080
+ blank |
3081
+ |
3082
+ title | 4.6 Conclusions
3083
+ text | We have presented a real-time guided scanning setup for online quality assessment of
3084
+ | streaming RGBD data obtained while acquiring indoor environments. The proposed
3085
+ | approach is motivated by three key observations: (i) indoor scenes largely consist of
3086
+ | a few different types of objects, each of which can be reasonably approximated by
3087
+ | commonly available 3D model sets; (ii) data is often missed due to self-occlusions,
3088
+ | and hence such missing regions can be predicted by comparisons against synthetically
3089
+ | scanned database models from multiple viewpoints; and (iii) streaming scan data can
3090
+ | be robustly and effectively compared against simulated scans by a direct comparison
3091
+ | of the distribution of relative local orientations in the two types of scans. The best
3092
+ | retrieved model is then used as a proxy to evaluate the quality of the current scan and
3093
+ | guide subsequent acquisition frames. We have demonstrated the real-time system on
3094
+ | a large number of synthetic and real-world examples with a database of 3D models,
3095
+ | often ranging in a few thousands.
3096
+ | In the future, we would like to extend our guided system to create online recon-
3097
+ | structions while specifically focusing on generating semantically valid scene models.
3098
+ | Using context information in the form of co-occurrence cues (e.g., a keyboard and
3099
+ | mouse are usually near each other) can prove to be effective. Finally, we plan to use
3100
+ | GPU-based optimized codes to handle additional categories of 3D models.
3101
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 80
3102
+ blank |
3103
+ |
3104
+ |
3105
+ |
3106
+ text | D2
3107
+ blank |
3108
+ |
3109
+ |
3110
+ |
3111
+ text | A2h
3112
+ blank |
3113
+ |
3114
+ |
3115
+ text | query
3116
+ | aligned model
3117
+ blank |
3118
+ |
3119
+ |
3120
+ |
3121
+ text | D2
3122
+ blank |
3123
+ |
3124
+ |
3125
+ |
3126
+ text | A2h
3127
+ blank |
3128
+ |
3129
+ |
3130
+ text | query
3131
+ | aligned model
3132
+ blank |
3133
+ text | Figure 4.3: Representative shape retrieval results using the D2 descriptor( [OFCD02],
3134
+ | first row), the A2h descriptor introduced in this chapter (Section 4.3.2, second row),
3135
+ | and the aligned models after scan registration (Section 4.3.4, third row) on the top 25
3136
+ | matches from A2h. For each method, we only show the top 4 matches. The D2 and
3137
+ | A2h descriptor (first two rows) are compared by histogram distributions, which is a
3138
+ | quick and efficient. Empirically, we observed the A2h descriptor to better capture
3139
+ | local geometric features compared to the D2 descriptor, with local registration further
3140
+ | improving the retrieval quality. The comparison based on 3D alignment (third row)
3141
+ | is more accurate, but require more computation time, and cannot be performed in
3142
+ | real-time given the size of our database of models.
3143
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 81
3144
+ blank |
3145
+ |
3146
+ |
3147
+ text | scanning setup
3148
+ blank |
3149
+ |
3150
+ |
3151
+ |
3152
+ text | detected groundplane
3153
+ blank |
3154
+ |
3155
+ |
3156
+ text | scanning guidance
3157
+ blank |
3158
+ |
3159
+ |
3160
+ |
3161
+ text | current scan
3162
+ blank |
3163
+ |
3164
+ |
3165
+ |
3166
+ text | current scan retreived model
3167
+ blank |
3168
+ text | Figure 4.4: The proposed guided real-time scanning setup is simple to use. The
3169
+ | user starts by scanning using a Microsoft Kinect (top-left). The system first detects
3170
+ | the ground plane and the user is notified (top-right). The current pointcloud corre-
3171
+ | sponding to the target object is displayed in the 3D view window, the best matching
3172
+ | database model is retrieved (overlaid in transparent white), and the predicted missing
3173
+ | voxels are highlighted as yellow voxels (middle-right). Based on the provided guid-
3174
+ | ance, the user acquires the next frame of data, and the process continues. Our method
3175
+ | stops when the retrieved shape explains well the captured pointcloud. Finally, the
3176
+ | overlaid 3D shape is highlighted in white (bottom-right). Note that the accumulated
3177
+ | scans have significant parts missing in most scanning steps.
3178
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 82
3179
+ blank |
3180
+ |
3181
+ |
3182
+ |
3183
+ text | Figure 4.5: Retrieval results with simulated data using a chair data set. Given the
3184
+ | model in the first column, the database of 2138 models are matched using the A2h
3185
+ | descriptor, and the top 5 matches are shown.
3186
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 83
3187
+ blank |
3188
+ |
3189
+ |
3190
+ |
3191
+ text | Figure 4.6: Retrieval results with simulated data using a couch data set. Given the
3192
+ | model in the first column, the database of 1765 models are matched using the A2h
3193
+ | descriptor, and the top 5 matches are shown.
3194
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 84
3195
+ blank |
3196
+ |
3197
+ |
3198
+ |
3199
+ text | Figure 4.7: Retrieval results with simulated data using a lamp data set. Given the
3200
+ | model in the first column, the database of 1805 models are matched using the A2h
3201
+ | descriptor, and the top 5 matches are shown.
3202
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 85
3203
+ blank |
3204
+ |
3205
+ |
3206
+ |
3207
+ text | Figure 4.8: Retrieval results with simulated data using a table data set. Given the
3208
+ | model in the first column, the database of 5239 models are matched using the A2h
3209
+ | descriptor, and the top 5 matches are shown.
3210
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 86
3211
+ blank |
3212
+ |
3213
+ |
3214
+ |
3215
+ text | View-dependent
3216
+ blank |
3217
+ |
3218
+ |
3219
+ text | Query object Merged scan
3220
+ blank |
3221
+ |
3222
+ |
3223
+ |
3224
+ text | View-dependent
3225
+ blank |
3226
+ |
3227
+ |
3228
+ text | Query object
3229
+ | Merged scan
3230
+ blank |
3231
+ |
3232
+ |
3233
+ |
3234
+ text | View-dependent
3235
+ blank |
3236
+ |
3237
+ text | Query object
3238
+ | Merged scan
3239
+ blank |
3240
+ text | Figure 4.9: Comparison between retrieval with view-dependant and merged scans.
3241
+ | The models are sorted by matching scores, with lower scores denoting better matches.
3242
+ | The leftmost images show the query scans. Note that the view-dependent scan-based
3243
+ | retrieval are robust even with significant missing regions (∼30-50%). The numbers
3244
+ | in parenthesis denote the view index.
3245
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 87
3246
+ blank |
3247
+ |
3248
+ |
3249
+ |
3250
+ text | Figure 4.10: Effect of density-aware sampling on two different combination of views
3251
+ | (comb1 and comb2). The sampling that considers the density of points are comb1d
3252
+ | and comb2d , respectively.
3253
+ blank |
3254
+ |
3255
+ |
3256
+ |
3257
+ text | Figure 4.11: Effect of noise. The shape of histogram becomes smoother as the level
3258
+ | of noise increases.
3259
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 88
3260
+ blank |
3261
+ |
3262
+ |
3263
+ text | image accumulated
3264
+ | proxy model scan
3265
+ | retrieved
3266
+ blank |
3267
+ |
3268
+ |
3269
+ |
3270
+ text | chairs
3271
+ blank |
3272
+ |
3273
+ |
3274
+ |
3275
+ text | couches
3276
+ blank |
3277
+ |
3278
+ |
3279
+ |
3280
+ text | lamps
3281
+ blank |
3282
+ |
3283
+ |
3284
+ |
3285
+ text | tables
3286
+ blank |
3287
+ |
3288
+ text | Figure 4.12: Real-time retrieval results on various datasets. For each set, we show
3289
+ | the image of the object being scanned, the accumulated pointcloud, and the closest
3290
+ | shape retrieved model, along with the top 25 candidates that are picked from the
3291
+ | database of thousands of models using the proposed A2h descriptor.
3292
+ meta | Chapter 5
3293
+ blank |
3294
+ title | Conclusions
3295
+ blank |
3296
+ text | 3-D reconstruction in indoor environment is a challenging problem because of the
3297
+ | complexity and variety of the objects present, and frequent changes in positions of
3298
+ | objects made by the people who inhabit space. Based on recent technology, the
3299
+ | work presented in this dissertation frames the reconstruction of indoor environment
3300
+ | as light-weight systems.
3301
+ | RGB-D cameras (e.g., Microsoft Kinect) are a new type of sensor and the standard
3302
+ | for utilizing the data is not yet fully established. Still, the sensor is revolutionary
3303
+ | because it is an affordable technology that can capture the 3-D data of everyday
3304
+ | environments at video frame rate. This dissertation covers quick pipelines that allow
3305
+ | possible real-time interaction between the user and the system. However, such data
3306
+ | comes at the price of complex noise characteristics.
3307
+ | To reconstruct the challenging indoor structures with limited data, we imposed
3308
+ | different geometric priors depending on the target applications and aimed for high-
3309
+ | level understanding. In chapter 2, we present a pipeline to acquire floor plans using
3310
+ | large planes as a geometric prior. We followed the well-known Manhattan-world
3311
+ | assumption and utilized user feedback to overcome ambiguous situations and specify
3312
+ | the important planes to be included in the model. Chapter 3 described our use
3313
+ | of simple models of repeating objects with deformation modes. Public places with
3314
+ | many of repeating objects can be reconstructed by recovering the low-dimensional
3315
+ | deformation and placement information. Chapter 4 showed how we retrieve complex
3316
+ blank |
3317
+ |
3318
+ meta | 89
3319
+ | CHAPTER 5. CONCLUSIONS 90
3320
+ blank |
3321
+ |
3322
+ |
3323
+ text | shape of objects with the help of a large database of 3-D models, as we develop a
3324
+ | descriptor that can be computed and searched efficiently and allow online quality
3325
+ | assessment to be presented to the user.
3326
+ | Each of the pipelines presented in these chapters targets at a specific application
3327
+ | and has been evaluated accordingly. The work of the dissertation can be extended
3328
+ | into other possible real-life applications that can connect actual environments with
3329
+ | the virtual world. The depth data from RGB-D cameras is easy to acquire, but we
3330
+ | still do not know how to make full use of the massive amount of information produced.
3331
+ | The potential applications can benefit from better understanding and handling of the
3332
+ | data. As one extension, we are interested in scaling the database of models and data
3333
+ | with special attention paid to data structure. The research community and others
3334
+ | would also benefit from the advances made in the use of reliable depth and color
3335
+ | features in the new type of data obtained from the RGB-D sensors in addition to the
3336
+ | presented descriptor.
3337
+ meta | Bibliography
3338
+ blank |
3339
+ ref | [BAD10] Soonmin Bae, Aseem Agarwala, and Fredo Durand. Computational
3340
+ | rephotography. ACM Trans. Graph., 29(5), 2010.
3341
+ blank |
3342
+ ref | [BM92] Paul J. Besl and Neil D. McKay. A method for registration of 3-D
3343
+ | shapes. IEEE PAMI, 14(2):239–256, 1992.
3344
+ blank |
3345
+ ref | [CTSO03] Ding-Yun Chen, Xiao-Pei Tian, Yu-Te Shen, and Ming Ouhyoung. On
3346
+ | visual similarity based 3d model retrieval. CGF, 22(3):223–232, 2003.
3347
+ blank |
3348
+ ref | [CY99] James M. Coughlan and A. L. Yuille. Manhattan world: Compass
3349
+ | direction from a single image by bayesian inference. In ICCV, pages
3350
+ | 941–947, 1999.
3351
+ blank |
3352
+ ref | [CZ11] Will Chang and Matthias Zwicker. Global registration of dynamic range
3353
+ | scans for articulated model reconstruction. ACM TOG, 30(3):26:1–
3354
+ | 26:15, 2011.
3355
+ blank |
3356
+ ref | [Dey07] T. K. Dey. Curve and Surface Reconstruction : Algorithms with Math-
3357
+ | ematical Analysis. Cambridge University Press, 2007.
3358
+ blank |
3359
+ ref | [DHR+ 11] Hao Du, Peter Henry, Xiaofeng Ren, Marvin Cheng, Dan B. Goldman,
3360
+ | Steven M. Seitz, and Dieter Fox. Interactive 3d modeling of indoor
3361
+ | environments with a consumer depth camera. In Proc. Ubiquitous com-
3362
+ | puting, pages 75–84, 2011.
3363
+ blank |
3364
+ ref | [EEH+ 11] Nikolas Engelhard, Felix Endres, Jürgen Hess, Jürgen Sturm, and Wol-
3365
+ | fram Burgard. Real-time 3D visual SLAM with a hand-held RGB-D
3366
+ blank |
3367
+ meta | 91
3368
+ | BIBLIOGRAPHY 92
3369
+ blank |
3370
+ |
3371
+ |
3372
+ ref | camera. In Proc. of the RGB-D Workshop on 3D Perception in Robotics
3373
+ | at the European Robotics Forum, 2011.
3374
+ blank |
3375
+ ref | [FB81] Martin A. Fischler and Robert C. Bolles. Random sample consensus:
3376
+ | a paradigm for model fitting with applications to image analysis and
3377
+ | automated cartography. Commun. ACM, 24(6):381–395, June 1981.
3378
+ blank |
3379
+ ref | [FCSS09] Y. Furukawa, B. Curless, S.M. Seitz, and R. Szeliski. Reconstructing
3380
+ | building interiors from images. In ICCV, pages 80–87, 2009.
3381
+ blank |
3382
+ ref | [FSH11] Matthew Fisher, Manolis Savva, and Pat Hanrahan. Characterizing
3383
+ | structural relationships in scenes using graph kernels. ACM TOG,
3384
+ | 30(4):34:1–34:11, 2011.
3385
+ blank |
3386
+ ref | [GCCMC08] Andrew P. Gee, Denis Chekhlov, Andrew Calway, and Walterio Mayol-
3387
+ | Cuevas. Discovering higher level structure in visual slam. IEEE Trans-
3388
+ | actions on Robotics, 24(5):980–990, October 2008.
3389
+ blank |
3390
+ ref | [GEH10] Abhinav Gupta, Alexei A. Efros, and Martial Hebert. Blocks world re-
3391
+ | visited: Image understanding using qualitative geometry and mechan-
3392
+ | ics. In ECCV, pages 482–496, 2010.
3393
+ blank |
3394
+ ref | [HCI+ 11] S. Hinterstoisser, S. Holzer, C. Cagniart, S. Ilic, K. Konolige, N. Navab,
3395
+ | and V. Lepetit. Multimodal templates for real-time detection of texture-
3396
+ | less objects in heavily cluttered scenes. ICCV, 2011.
3397
+ blank |
3398
+ ref | [HKG11] Qixing Huang, Vladlen Koltun, and Leonidas Guibas. Joint-shape seg-
3399
+ | mentation with linear programming. ACM TOG (SIGGRAPH Asia),
3400
+ | 30(6):125:1–125:11, 2011.
3401
+ blank |
3402
+ ref | [HKH+ 12] Peter Henry, Michael Krainin, Evan Herbst, Xiaofeng Ren, and Dieter
3403
+ | Fox. RGBD mapping: Using kinect-style depth cameras for dense 3D
3404
+ | modeling of indoor environments. I. J. Robotic Res., 31(5):647–663,
3405
+ | 2012.
3406
+ meta | BIBLIOGRAPHY 93
3407
+ blank |
3408
+ |
3409
+ |
3410
+ ref | [IKH+ 11] Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard
3411
+ | Newcombe, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Free-
3412
+ | man, Andrew Davison, and Andrew Fitzgibbon. Kinectfusion: real-time
3413
+ | 3D reconstruction and interaction using a moving depth camera. In
3414
+ | Proc. UIST, pages 559–568, 2011.
3415
+ blank |
3416
+ ref | [Joh97] Andrew Johnson. Spin-Images: A Representation for 3-D Surface
3417
+ | Matching. PhD thesis, Robotics Institute, CMU, 1997.
3418
+ blank |
3419
+ ref | [JTRS12] Arjun Jain, Thorsten Thormahlen, Tobias Ritschel, and Hans-Peter Sei-
3420
+ | del. Exploring shape variations by 3d-model decomposition and part-
3421
+ | based recombination. CGF (EUROGRAPHICS), 31(2):631–640, 2012.
3422
+ blank |
3423
+ ref | [KAJS11] H.S. Koppula, A. Anand, T. Joachims, and A. Saxena. Semantic la-
3424
+ | beling of 3D point clouds for indoor scenes. In NIPS, pages 244–252,
3425
+ | 2011.
3426
+ blank |
3427
+ ref | [KDS+ 12] Young Min Kim, Jennifer Dolson, Michael Sokolsky, Vladlen Koltun,
3428
+ | and Sebastian Thrun. Interactive acquisition of residential floor plans.
3429
+ | In ICRA, pages 3055–3062, 2012.
3430
+ blank |
3431
+ ref | [KMYG12] Young Min Kim, Niloy J. Mitra, Dong-Ming Yan, and Leonidas Guibas.
3432
+ | Acquiring 3d indoor environments with variability and repetition. ACM
3433
+ | TOG, 31(6), 2012.
3434
+ blank |
3435
+ ref | [LAGP09] Hao Li, Bart Adams, Leonidas J. Guibas, and Mark Pauly. Robust
3436
+ | single-view geometry and motion reconstruction. ACM TOG (SIG-
3437
+ | GRAPH), 28(5):175:1–175:10, 2009.
3438
+ blank |
3439
+ ref | [LGHK10] David Changsoo Lee, Abhinav Gupta, Martial Hebert, and Takeo
3440
+ | Kanade. Estimating spatial layout of rooms using volumetric reasoning
3441
+ | about objects and surfaces. In NIPS, pages 1288–1296, 2010.
3442
+ blank |
3443
+ ref | [LH05] Marius Leordeanu and Martial Hebert. A spectral technique for cor-
3444
+ | respondence problems using pairwise constraints. In ICCV, volume 2,
3445
+ | pages 1482–1489, 2005.
3446
+ meta | BIBLIOGRAPHY 94
3447
+ blank |
3448
+ |
3449
+ |
3450
+ ref | [MFO+ 07] Niloy J. Mitra, Simon Flory, Maks Ovsjanikov, Natasha Gelfand,
3451
+ | Leonidas Guibas, and Helmut Pottmann. Dynamic geometry registra-
3452
+ | tion. In Symp. on Geometry Proc., pages 173–182, 2007.
3453
+ blank |
3454
+ ref | [Mic10] MicroSoft. Kinect for xbox 360. http://www.xbox.com/en-US/kinect,
3455
+ | November 2010.
3456
+ blank |
3457
+ ref | [MM09] Pranav Mistry and Pattie Maes. Sixthsense: a wearable gestural in-
3458
+ | terface. In SIGGRAPH ASIA Art Gallery & Emerging Technologies,
3459
+ | page 85, 2009.
3460
+ blank |
3461
+ ref | [MPWC12] Niloy J. Mitra, Mark Pauly, Michael Wand, and Duygu Ceylan. Symme-
3462
+ | try in 3d geometry: Extraction and applications. In EUROGRAPHICS
3463
+ | State-of-the-art Report, 2012.
3464
+ blank |
3465
+ ref | [MYY+ 10] N. Mitra, Y.-L. Yang, D.-M. Yan, W. Li, and M. Agrawala. Illus-
3466
+ | trating how mechanical assemblies work. ACM TOG (SIGGRAPH),
3467
+ | 29(4):58:1–58:12, 2010.
3468
+ blank |
3469
+ ref | [MZL+ 09] Ravish Mehra, Qingnan Zhou, Jeremy Long, Alla Sheffer, Amy Gooch,
3470
+ | and Niloy J. Mitra. Abstraction of man-made shapes. ACM TOG
3471
+ | (SIGGRAPH Asia), 28(5):#137, 1–10, 2009.
3472
+ blank |
3473
+ ref | [ND10] Richard A. Newcombe and Andrew J. Davison. Live dense reconstruc-
3474
+ | tion with a single moving camera. In CVPR, 2010.
3475
+ blank |
3476
+ ref | [NXS12] Liangliang Nan, Ke Xie, and Andrei Sharf. A search-classify approach
3477
+ | for cluttered indoor scene understanding. ACM TOG (SIGGRAPH
3478
+ | Asia), 31(6), 2012.
3479
+ blank |
3480
+ ref | [OFCD02] Robert Osada, Thomas Funkhouser, Bernard Chazelle, and David
3481
+ | Dobkin. Shape distributions. ACM Transactions on Graphics,
3482
+ | 21(4):807–832, October 2002.
3483
+ meta | BIBLIOGRAPHY 95
3484
+ blank |
3485
+ |
3486
+ |
3487
+ ref | [OLGM11] Maks Ovsjanikov, Wilmot Li, Leonidas Guibas, and Niloy J. Mitra.
3488
+ | Exploration of continuous variability in collections of 3D shapes. ACM
3489
+ | TOG (SIGGRAPH), 30(4):33:1–33:10, 2011.
3490
+ blank |
3491
+ ref | [PMG+ 05] Mark Pauly, Niloy J. Mitra, Joachim Giesen, Markus Gross, and
3492
+ | Leonidas J. Guibas. Example-based 3D scan completion. In Symp.
3493
+ | on Geometry Proc., pages 23–32, 2005.
3494
+ blank |
3495
+ ref | [PMW+ 08] M. Pauly, N. J. Mitra, J. Wallner, H. Pottmann, and L. Guibas. Discov-
3496
+ | ering structural regularity in 3D geometry. ACM TOG (SIGGRAPH),
3497
+ | 27(3):43:1–43:11, 2008.
3498
+ blank |
3499
+ ref | [RBF12] Xiaofeng Ren, Liefeng Bo, and D. Fox. RGB-D scene labeling: Features
3500
+ | and algorithms. In CVPR, pages 2759 – 2766, 2012.
3501
+ blank |
3502
+ ref | [RHHL02] Szymon Rusinkiewicz, Olaf Hall-Holt, and Marc Levoy. Real-time 3D
3503
+ | model acquisition. ACM TOG (SIGGRAPH), 21(3):438–446, 2002.
3504
+ blank |
3505
+ ref | [RL01] Szymon Rusinkiewicz and Marc Levoy. Efficient variants of the icp
3506
+ | algorithm. In Proc. 3DIM, 2001.
3507
+ blank |
3508
+ ref | [RTG98] Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. A metric for
3509
+ | distributions with applications to image databases. In ICCV, pages
3510
+ | 59–, 1998.
3511
+ blank |
3512
+ ref | [SFC+ 11] Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark
3513
+ | Finocchio, Richard Moore, Alex Kipman, and Andrew Blake. Real-
3514
+ | time human pose recognition in parts from a single depth image. In
3515
+ | CVPR, pages 1297–1304, 2011.
3516
+ blank |
3517
+ ref | [SvKK+ 11] Oana Sidi, Oliver van Kaick, Yanir Kleiman, Hao Zhang, and Daniel
3518
+ | Cohen-Or. Unsupervised co-segmentation of a set of shapes via
3519
+ | descriptor-space spectral clustering. ACM TOG (SIGGRAPH Asia),
3520
+ | 30(6):126:1–126:10, 2011.
3521
+ meta | BIBLIOGRAPHY 96
3522
+ blank |
3523
+ |
3524
+ |
3525
+ ref | [SWK07] Ruwen Schnabel, Roland Wahl, and Reinhard Klein. Efficient RANSAC
3526
+ | for point-cloud shape detection. CGF (EUROGRAPHICS), 26(2):214–
3527
+ | 226, 2007.
3528
+ blank |
3529
+ ref | [SWWK08] Ruwen Schnabel, Raoul Wessel, Roland Wahl, and Reinhard Klein.
3530
+ | Shape recognition in 3D point-clouds. In Proc. WSCG, pages 65–72,
3531
+ | 2008.
3532
+ blank |
3533
+ ref | [SXZ+ 12] Tianjia Shao, Weiwei Xu, Kun Zhou, Jingdong Wang, Dongping Li, and
3534
+ | Baining Guo. An interactive approach to semantic modeling of indoor
3535
+ | scenes with an RGBD camera. ACM TOG (SIGGRAPH Asia), 31(6),
3536
+ | 2012.
3537
+ blank |
3538
+ ref | [Thr02] S. Thrun. Robotic mapping: A survey. In G. Lakemeyer and B. Nebel,
3539
+ | editors, Exploring Artificial Intelligence in the New Millenium. Morgan
3540
+ | Kaufmann, 2002.
3541
+ blank |
3542
+ ref | [TMHF00] Bill Triggs, Philip F. McLauchlan, Richard I. Hartley, and Andrew W.
3543
+ | Fitzgibbon. Bundle adjustment - a modern synthesis. In Proceedings of
3544
+ | the International Workshop on Vision Algorithms: Theory and Practice,
3545
+ | ICCV ’99. Springer-Verlag, 2000.
3546
+ blank |
3547
+ ref | [TSS10] R. Triebel, J. Shin, and R. Siegwart. Segmentation and unsupervised
3548
+ | part-based discovery of repetitive objects. In Proceedings of Robotics:
3549
+ | Science and Systems, 2010.
3550
+ blank |
3551
+ ref | [TW05] Sebastian Thrun and Ben Wegbreit. Shape from symmetry. In ICCV,
3552
+ | pages 1824–1831, 2005.
3553
+ blank |
3554
+ ref | [VAB10] Carlos A. Vanegas, Daniel G. Aliaga, and Bedrich Benes. Building
3555
+ | reconstruction using manhattan-world grammars. In CVPR, pages 358–
3556
+ | 365, 2010.
3557
+ blank |
3558
+ ref | [Vil03] C. Villani. Topics in Optimal Transportation. Graduate Studies in
3559
+ | Mathematics. American Mathematical Society, 2003.
3560
+ meta | BIBLIOGRAPHY 97
3561
+ blank |
3562
+ |
3563
+ |
3564
+ ref | [XLZ+ 10] Kai Xu, Honghua Li, Hao Zhang, Daniel Cohen-Or, Yueshan Xiong,
3565
+ | and Zhiquan Cheng. Style-content separation by anisotropic part scales.
3566
+ | ACM TOG (SIGGRAPH Asia), 29(5):184:1–184:10, 2010.
3567
+ blank |
3568
+ ref | [XS12] Yu Xiang and Silvio Savarese. Estimating the aspect layout of object
3569
+ | categories. In CVPR, pages 3410–3417, 2012.
3570
+ blank |
3571
+ ref | [XZZ+ 11] Kai Xu, Hanlin Zheng, Hao Zhang, Daniel Cohen-Or, , Ligang Liu, and
3572
+ | Yueshan Xiong. Photo-inspired model-driven 3D object modeling. ACM
3573
+ | TOG (SIGGRAPH), 30(4):80:1–80:10, 2011.
3574
+ blank |
3575
+ ref | [ZCC+ 12] Youyi Zheng, Xiang Chen, Ming-Ming Cheng, Kun Zhou, Shi-Min Hu,
3576
+ | and Niloy J. Mitra. Interactive images: Cuboid proxies for smart image
3577
+ | manipulation. ACM TOG (SIGGRAPH), 31(4):99:1–99:11, 2012.
3578
+ blank |