anystyle 1.0.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/HISTORY.md +78 -0
- data/LICENSE +27 -0
- data/README.md +103 -0
- data/lib/anystyle.rb +71 -0
- data/lib/anystyle/dictionary.rb +132 -0
- data/lib/anystyle/dictionary/gdbm.rb +52 -0
- data/lib/anystyle/dictionary/lmdb.rb +67 -0
- data/lib/anystyle/dictionary/marshal.rb +27 -0
- data/lib/anystyle/dictionary/redis.rb +55 -0
- data/lib/anystyle/document.rb +264 -0
- data/lib/anystyle/errors.rb +14 -0
- data/lib/anystyle/feature.rb +27 -0
- data/lib/anystyle/feature/affix.rb +43 -0
- data/lib/anystyle/feature/brackets.rb +32 -0
- data/lib/anystyle/feature/canonical.rb +13 -0
- data/lib/anystyle/feature/caps.rb +20 -0
- data/lib/anystyle/feature/category.rb +70 -0
- data/lib/anystyle/feature/dictionary.rb +16 -0
- data/lib/anystyle/feature/indent.rb +16 -0
- data/lib/anystyle/feature/keyword.rb +52 -0
- data/lib/anystyle/feature/line.rb +39 -0
- data/lib/anystyle/feature/locator.rb +18 -0
- data/lib/anystyle/feature/number.rb +39 -0
- data/lib/anystyle/feature/position.rb +28 -0
- data/lib/anystyle/feature/punctuation.rb +22 -0
- data/lib/anystyle/feature/quotes.rb +20 -0
- data/lib/anystyle/feature/ref.rb +21 -0
- data/lib/anystyle/feature/terminal.rb +19 -0
- data/lib/anystyle/feature/words.rb +74 -0
- data/lib/anystyle/finder.rb +94 -0
- data/lib/anystyle/format/bibtex.rb +63 -0
- data/lib/anystyle/format/csl.rb +28 -0
- data/lib/anystyle/normalizer.rb +65 -0
- data/lib/anystyle/normalizer/brackets.rb +13 -0
- data/lib/anystyle/normalizer/container.rb +13 -0
- data/lib/anystyle/normalizer/date.rb +109 -0
- data/lib/anystyle/normalizer/edition.rb +16 -0
- data/lib/anystyle/normalizer/journal.rb +14 -0
- data/lib/anystyle/normalizer/locale.rb +30 -0
- data/lib/anystyle/normalizer/location.rb +24 -0
- data/lib/anystyle/normalizer/locator.rb +22 -0
- data/lib/anystyle/normalizer/names.rb +88 -0
- data/lib/anystyle/normalizer/page.rb +29 -0
- data/lib/anystyle/normalizer/publisher.rb +18 -0
- data/lib/anystyle/normalizer/pubmed.rb +18 -0
- data/lib/anystyle/normalizer/punctuation.rb +23 -0
- data/lib/anystyle/normalizer/quotes.rb +14 -0
- data/lib/anystyle/normalizer/type.rb +54 -0
- data/lib/anystyle/normalizer/volume.rb +26 -0
- data/lib/anystyle/parser.rb +199 -0
- data/lib/anystyle/support.rb +4 -0
- data/lib/anystyle/support/finder.mod +3234 -0
- data/lib/anystyle/support/finder.txt +75 -0
- data/lib/anystyle/support/parser.mod +15025 -0
- data/lib/anystyle/support/parser.txt +75 -0
- data/lib/anystyle/utils.rb +70 -0
- data/lib/anystyle/version.rb +3 -0
- data/res/finder/bb132pr2055.ttx +6803 -0
- data/res/finder/bb550sh8053.ttx +18660 -0
- data/res/finder/bb599nz4341.ttx +2957 -0
- data/res/finder/bb725rt6501.ttx +15276 -0
- data/res/finder/bc605xz1554.ttx +18815 -0
- data/res/finder/bd040gx5718.ttx +4271 -0
- data/res/finder/bd413nt2715.ttx +4956 -0
- data/res/finder/bd466fq0394.ttx +6100 -0
- data/res/finder/bf668vw2021.ttx +3578 -0
- data/res/finder/bg495cx0468.ttx +7267 -0
- data/res/finder/bg599vt3743.ttx +6752 -0
- data/res/finder/bg608dx2253.ttx +4094 -0
- data/res/finder/bh410qk3771.ttx +8785 -0
- data/res/finder/bh989ww6442.ttx +17204 -0
- data/res/finder/bj581pc8202.ttx +2719 -0
- data/res/parser/bad.xml +5199 -0
- data/res/parser/core.xml +7924 -0
- data/res/parser/gold.xml +2707 -0
- data/res/parser/good.xml +34281 -0
- data/res/parser/stanford-books.xml +2280 -0
- data/res/parser/stanford-diss.xml +726 -0
- data/res/parser/stanford-theses.xml +4684 -0
- data/res/parser/ugly.xml +33246 -0
- metadata +195 -0
@@ -0,0 +1,3578 @@
|
|
1
|
+
title | A LIGHT-WEIGHT 3-D INDOOR ACQUISITION SYSTEM
|
2
|
+
| USING AN RGB-D CAMERA
|
3
|
+
blank |
|
4
|
+
|
|
5
|
+
|
|
6
|
+
|
|
7
|
+
title | A DISSERTATION
|
8
|
+
| SUBMITTED TO THE DEPARTMENT OF ELECTRICAL
|
9
|
+
| ENGINEERING
|
10
|
+
| AND THE COMMITTEE ON GRADUATE STUDIES
|
11
|
+
| OF STANFORD UNIVERSITY
|
12
|
+
| IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
|
13
|
+
| FOR THE DEGREE OF
|
14
|
+
| DOCTOR OF PHILOSOPHY
|
15
|
+
blank |
|
16
|
+
|
|
17
|
+
|
|
18
|
+
|
|
19
|
+
text | Young Min Kim
|
20
|
+
| August 2013
|
21
|
+
| © 2013 by Young Min Kim. All Rights Reserved.
|
22
|
+
| Re-distributed by Stanford University under license with the author.
|
23
|
+
blank |
|
24
|
+
|
|
25
|
+
|
|
26
|
+
text | This work is licensed under a Creative Commons Attribution-
|
27
|
+
| Noncommercial 3.0 United States License.
|
28
|
+
| http://creativecommons.org/licenses/by-nc/3.0/us/
|
29
|
+
blank |
|
30
|
+
|
|
31
|
+
|
|
32
|
+
|
|
33
|
+
text | This dissertation is online at: http://purl.stanford.edu/bf668vw2021
|
34
|
+
blank |
|
35
|
+
text | Includes supplemental files:
|
36
|
+
| 1. Video for Chapter 4 (video_final_medium3.wmv)
|
37
|
+
| 2. Video for Chapter 2 (Reconstruct.mpg)
|
38
|
+
blank |
|
39
|
+
|
|
40
|
+
|
|
41
|
+
|
|
42
|
+
meta | ii
|
43
|
+
text | I certify that I have read this dissertation and that, in my opinion, it is fully adequate
|
44
|
+
| in scope and quality as a dissertation for the degree of Doctor of Philosophy.
|
45
|
+
blank |
|
46
|
+
text | Leonidas Guibas, Primary Adviser
|
47
|
+
blank |
|
48
|
+
|
|
49
|
+
|
|
50
|
+
text | I certify that I have read this dissertation and that, in my opinion, it is fully adequate
|
51
|
+
| in scope and quality as a dissertation for the degree of Doctor of Philosophy.
|
52
|
+
blank |
|
53
|
+
text | Bernd Girod
|
54
|
+
blank |
|
55
|
+
|
|
56
|
+
|
|
57
|
+
text | I certify that I have read this dissertation and that, in my opinion, it is fully adequate
|
58
|
+
| in scope and quality as a dissertation for the degree of Doctor of Philosophy.
|
59
|
+
blank |
|
60
|
+
text | Sebastian Thrun
|
61
|
+
blank |
|
62
|
+
|
|
63
|
+
|
|
64
|
+
|
|
65
|
+
text | Approved for the Stanford University Committee on Graduate Studies.
|
66
|
+
| Patricia J. Gumport, Vice Provost for Graduate Education
|
67
|
+
blank |
|
68
|
+
|
|
69
|
+
|
|
70
|
+
|
|
71
|
+
text | This signature page was generated electronically upon submission of this dissertation in
|
72
|
+
| electronic format. An original signed hard copy of the signature page is on file in
|
73
|
+
| University Archives.
|
74
|
+
blank |
|
75
|
+
|
|
76
|
+
|
|
77
|
+
|
|
78
|
+
meta | iii
|
79
|
+
title | Abstract
|
80
|
+
blank |
|
81
|
+
text | Large-scale acquisition of exterior urban environments is by now a well-established
|
82
|
+
| technology, supporting many applications in map searching, navigation, and com-
|
83
|
+
| merce. The same is, however, not the case for indoor environments, where access is
|
84
|
+
| often restricted and the spaces can be cluttered. Recent advances in real-time 3D
|
85
|
+
| acquisition devices (e.g., Microsoft Kinect) enable everyday users to scan complex
|
86
|
+
| indoor environments at a video rate. Raw scans, however, are often noisy, incom-
|
87
|
+
| plete, and significantly corrupted, making semantic scene understanding difficult, if
|
88
|
+
| not impossible. In this dissertation, we present ways of utilizing prior information
|
89
|
+
| to semantically understand the environments from the noisy scans of real-time 3-D
|
90
|
+
| sensors. The presented pipelines are light-weighted, and have the potential to allow
|
91
|
+
| users to provide feedback at interactive rates.
|
92
|
+
| We first present a hand-held system for real-time, interactive acquisition of res-
|
93
|
+
| idential floor plans. The system integrates a commodity range camera, a micro-
|
94
|
+
| projector, and a button interface for user input and allows the user to freely move
|
95
|
+
| through a building to capture its important architectural elements. The system uses
|
96
|
+
| the Manhattan world assumption, which posits that wall layouts are rectilinear. This
|
97
|
+
| assumption allows generation of floor plans in real time, enabling the operator to
|
98
|
+
| interactively guide the reconstruction process and to resolve structural ambiguities
|
99
|
+
| and errors during the acquisition. The interactive component aids users with no ar-
|
100
|
+
| chitectural training in acquiring wall layouts for their residences. We show a number
|
101
|
+
| of residential floor plans reconstructed with the system.
|
102
|
+
| We then discuss how we exploit the fact that public environments typically contain
|
103
|
+
| a high density of repeated objects (e.g., tables, chairs, monitors, etc.) in regular or
|
104
|
+
blank |
|
105
|
+
|
|
106
|
+
meta | iv
|
107
|
+
text | non-regular arrangements with significant pose variations and articulations. We use
|
108
|
+
| the special structure of indoor environments to accelerate their 3D acquisition and
|
109
|
+
| recognition. Our approach consists of two phases: (i) a learning phase wherein we
|
110
|
+
| acquire 3D models of frequently occurring objects and capture their variability modes
|
111
|
+
| from only a few scans, and (ii) a recognition phase wherein from a single scan of a
|
112
|
+
| new area, we identify previously seen objects but in different poses and locations at
|
113
|
+
| an average recognition time of 200ms/model. We evaluate the robustness and limits
|
114
|
+
| of the proposed recognition system using a range of synthetic and real-world scans
|
115
|
+
| under challenging settings.
|
116
|
+
| Last, we present a guided real-time scanning setup, wherein the incoming 3D
|
117
|
+
| data stream is continuously analyzed, and the data quality is automatically assessed.
|
118
|
+
| While the user is scanning an object, the proposed system discovers and highlights
|
119
|
+
| the missing parts, thus guiding the operator (or the autonomous robot) to ’‘where
|
120
|
+
| to scan next”. We assess the data quality and completeness of the 3D scan data
|
121
|
+
| by comparing to a large collection of commonly occurring indoor man-made objects
|
122
|
+
| using an efficient, robust, and effective scan descriptor. We have tested the system
|
123
|
+
| on a large number of simulated and real setups, and found the guided interface to be
|
124
|
+
| effective even in cluttered and complex indoor environments. Overall, the research
|
125
|
+
| presented in the dissertation discusses how low-quality 3-D scans can be effectively
|
126
|
+
| used to understand indoor environments and allow necessary user-interaction in real-
|
127
|
+
| time. The presented pipelines are designed to be quick and effective by utilizing
|
128
|
+
| different geometric priors depending on the target applications.
|
129
|
+
blank |
|
130
|
+
|
|
131
|
+
|
|
132
|
+
|
|
133
|
+
meta | v
|
134
|
+
title | Acknowledgements
|
135
|
+
blank |
|
136
|
+
text | All the work presented in this thesis would not have been possible without help from
|
137
|
+
| many people.
|
138
|
+
| First of all, I would like to express my sincerest gratitude to my advisor, Leonidas
|
139
|
+
| Guibas. He is not only an intelligent and inspiring scholar in amazingly diverse
|
140
|
+
| topics, but also a very caring advisor with deep insights into various aspects of life.
|
141
|
+
| He guided me through one of the toughest times of my life, and I am lucky to be one
|
142
|
+
| of his students.
|
143
|
+
| During my life at Stanford, I had the privilege of working with the smartest people
|
144
|
+
| in the world learning not only about research, but also about the different mind-sets
|
145
|
+
| that lead to successful careers. I would like to thank Bernd Girod, Christian Theobalt,
|
146
|
+
| Sebastian Thrun, Vladlen Koltun, Niloy Mitra, Saumitra Das, Stephen Gould, and
|
147
|
+
| Adrian Butscher for being mentors during different stages of my graduate career. I
|
148
|
+
| also appreciate help of wonderful collaborators on exciting projects: Jana Kosecka,
|
149
|
+
| Branislav Miscusik, James Diebel, Mike Sokolsky, Jen Dolson, Dongming Yan, and
|
150
|
+
| Qixing Huang.
|
151
|
+
| The work presented here was generously supported by the following funding
|
152
|
+
| sources: Samsung Scholarship, MPC-VCC, Qualcomm corporation.
|
153
|
+
| I adore my officemates for being cheerful and encouraging, and most of all, being
|
154
|
+
| there: Derek Chan, Rahul Biswas, Stephanie Lefevre, Qixing Huang, Jonathan Jiang,
|
155
|
+
| Art Tevs, Michael Kerber, Justin Solomon, Jonathan Huang, Fan Wang, Daniel Chen,
|
156
|
+
| Kyle Heath, Vangelis Kalogerakis, and Sharath Kumar Raghvendra. I often spent
|
157
|
+
| more time with them than with any other people.
|
158
|
+
| I have to thank all the friends I met at Stanford. In particular, I would like to
|
159
|
+
blank |
|
160
|
+
|
|
161
|
+
meta | vi
|
162
|
+
text | thank Stephanie Kwan, Karen Zhu, Landry Huet, and Yiting Yeh for fun hangouts
|
163
|
+
| and random conversations in my early years. I was also fortunate enough to meet a
|
164
|
+
| wonderful chamber music group led by Dr. Herbert Myers in which I could play early
|
165
|
+
| music with Michael Peterson and Lisa Silverman. I also appreciate for being able to
|
166
|
+
| participate in a wonderful WISE (Women in Science and Engineering) group. WISE
|
167
|
+
| girls have always been smart, tender and supportive. Many Korean friends at Stanford
|
168
|
+
| were like family for me here. I will not attempt to name them all, but I would like to
|
169
|
+
| especially thank Jeongha Park, Soogine Chong, Sun-Hae Hong, Jenny Lee, Ga-Young
|
170
|
+
| Suh, Joyce Lee, Hyeji Kim, Sun Goo Lee, Wookyung Kim, Han Ho Song and Su-In
|
171
|
+
| Lee. While I was enjoying my life at Stanford, I was always connected to my friends
|
172
|
+
| in Korea. I would like to express my thanks for their trust and everlasting friendship.
|
173
|
+
| Last, I cannot thank to my family enough. I would like to dedicate my thesis to my
|
174
|
+
| parents, Kwang Woo Kim and Mi Ja Lee. Their constant love and trust have helped
|
175
|
+
| me overcome hardships ever since I was born. I also enjoyed having my brother, Joo
|
176
|
+
| Hwan Kim, in the Bay Area. His passion and thoughtful advice always helped me
|
177
|
+
| and cheered me up. I thank my husband, Sung-Boem Park, for being by my side no
|
178
|
+
| matter what happened. He is my best friend, and he made me face and overcome
|
179
|
+
| challenges. I also need to thank my soon-to-be born son (due in August), for allowing
|
180
|
+
| me to accelerate the last stages of my Ph. D.
|
181
|
+
| Thank you all for making me who I am today.
|
182
|
+
blank |
|
183
|
+
|
|
184
|
+
|
|
185
|
+
|
|
186
|
+
meta | vii
|
187
|
+
title | Contents
|
188
|
+
blank |
|
189
|
+
text | Abstract iv
|
190
|
+
blank |
|
191
|
+
text | Acknowledgements vi
|
192
|
+
blank |
|
193
|
+
text | 1 Introduction 1
|
194
|
+
| 1.1 Background on RGB-D Cameras . . . . . . . . . . . . . . . . . . . . 3
|
195
|
+
| 1.1.1 Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
|
196
|
+
| 1.1.2 Noise Characteristics . . . . . . . . . . . . . . . . . . . . . . . 5
|
197
|
+
| 1.2 3-D Indoor Acquisition System . . . . . . . . . . . . . . . . . . . . . 6
|
198
|
+
| 1.3 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 7
|
199
|
+
| 1.3.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
|
200
|
+
blank |
|
201
|
+
text | 2 Interactive Acquisition of Residential Floor Plans1 11
|
202
|
+
| 2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
|
203
|
+
| 2.2 System Overview and Usage . . . . . . . . . . . . . . . . . . . . . . . 14
|
204
|
+
| 2.3 Data Acquisition Process . . . . . . . . . . . . . . . . . . . . . . . . . 16
|
205
|
+
| 2.3.1 Pair-Wise Registration . . . . . . . . . . . . . . . . . . . . . . 19
|
206
|
+
| 2.3.2 Plane Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 22
|
207
|
+
| 2.3.3 Global Adjustment . . . . . . . . . . . . . . . . . . . . . . . . 23
|
208
|
+
| 2.3.4 Map Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
|
209
|
+
| 2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
|
210
|
+
| 2.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . 29
|
211
|
+
blank |
|
212
|
+
|
|
213
|
+
|
|
214
|
+
|
|
215
|
+
meta | viii
|
216
|
+
text | 3 Environments with Variability and Repetition 33
|
217
|
+
| 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
|
218
|
+
| 3.1.1 Scanning Technology . . . . . . . . . . . . . . . . . . . . . . . 35
|
219
|
+
| 3.1.2 Geometric Priors for Objects . . . . . . . . . . . . . . . . . . . 35
|
220
|
+
| 3.1.3 Scene Understanding . . . . . . . . . . . . . . . . . . . . . . . 36
|
221
|
+
| 3.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
|
222
|
+
| 3.2.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
|
223
|
+
| 3.2.2 Hierarchical Structure . . . . . . . . . . . . . . . . . . . . . . 40
|
224
|
+
| 3.3 Learning Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
|
225
|
+
| 3.3.1 Initializing the Skeleton of the Model . . . . . . . . . . . . . . 43
|
226
|
+
| 3.3.2 Incrementally Completing a Coherent Model . . . . . . . . . . 45
|
227
|
+
| 3.4 Recognition Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
|
228
|
+
| 3.4.1 Initial Assignment for Parts . . . . . . . . . . . . . . . . . . . 47
|
229
|
+
| 3.4.2 Refined Assignment with Geometry . . . . . . . . . . . . . . . 49
|
230
|
+
| 3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
|
231
|
+
| 3.5.1 Synthetic Scenes . . . . . . . . . . . . . . . . . . . . . . . . . 51
|
232
|
+
| 3.5.2 Real-World Scenes . . . . . . . . . . . . . . . . . . . . . . . . 54
|
233
|
+
| 3.5.3 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
|
234
|
+
| 3.5.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
|
235
|
+
| 3.5.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
|
236
|
+
| 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
|
237
|
+
blank |
|
238
|
+
text | 4 Guided Real-Time Scanning 64
|
239
|
+
| 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
|
240
|
+
| 4.1.1 Interactive Acquisition . . . . . . . . . . . . . . . . . . . . . . 67
|
241
|
+
| 4.1.2 Scan Completion . . . . . . . . . . . . . . . . . . . . . . . . . 67
|
242
|
+
| 4.1.3 Part-Based Modeling . . . . . . . . . . . . . . . . . . . . . . . 67
|
243
|
+
| 4.1.4 Template-Based Completion . . . . . . . . . . . . . . . . . . . 68
|
244
|
+
| 4.1.5 Shape Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . 68
|
245
|
+
| 4.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
|
246
|
+
| 4.2.1 Scan Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 70
|
247
|
+
blank |
|
248
|
+
|
|
249
|
+
meta | ix
|
250
|
+
text | 4.2.2 Shape Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 70
|
251
|
+
| 4.2.3 Scan Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 71
|
252
|
+
| 4.3 Partial Shape Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 71
|
253
|
+
| 4.3.1 View-Dependent Simulated Scans . . . . . . . . . . . . . . . . 72
|
254
|
+
| 4.3.2 A2h Scan Descriptor . . . . . . . . . . . . . . . . . . . . . . . 73
|
255
|
+
| 4.3.3 Descriptor-Based Shape Matching . . . . . . . . . . . . . . . . 74
|
256
|
+
| 4.3.4 Scan Registration . . . . . . . . . . . . . . . . . . . . . . . . . 75
|
257
|
+
| 4.4 Interface Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
|
258
|
+
| 4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
|
259
|
+
| 4.5.1 Model Database . . . . . . . . . . . . . . . . . . . . . . . . . . 76
|
260
|
+
| 4.5.2 Retrieval Results with Simulated Data . . . . . . . . . . . . . 77
|
261
|
+
| 4.5.3 Retrieval Results with Real Data . . . . . . . . . . . . . . . . 78
|
262
|
+
| 4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
|
263
|
+
blank |
|
264
|
+
text | 5 Conclusions 89
|
265
|
+
blank |
|
266
|
+
text | Bibliography 91
|
267
|
+
blank |
|
268
|
+
|
|
269
|
+
|
|
270
|
+
|
|
271
|
+
meta | x
|
272
|
+
title | List of Tables
|
273
|
+
blank |
|
274
|
+
text | 2.1 Accuracy comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 29
|
275
|
+
blank |
|
276
|
+
text | 3.1 Parameters used in our algorithm . . . . . . . . . . . . . . . . . . . . 41
|
277
|
+
| 3.2 Models obtained from the learning phase . . . . . . . . . . . . . . . . 55
|
278
|
+
| 3.3 Statistics for the recognition phase . . . . . . . . . . . . . . . . . . . 56
|
279
|
+
| 3.4 Statistics between objects learned for each scene category . . . . . . . 59
|
280
|
+
blank |
|
281
|
+
text | 4.1 Database and scan statistics . . . . . . . . . . . . . . . . . . . . . . . 76
|
282
|
+
blank |
|
283
|
+
|
|
284
|
+
|
|
285
|
+
|
|
286
|
+
meta | xi
|
287
|
+
title | List of Figures
|
288
|
+
blank |
|
289
|
+
text | 1.1 Triangulation principle . . . . . . . . . . . . . . . . . . . . . . . . . . 4
|
290
|
+
| 1.2 Kinect sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
|
291
|
+
blank |
|
292
|
+
text | 2.1 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
|
293
|
+
| 2.2 System pipeline and usage . . . . . . . . . . . . . . . . . . . . . . . . 15
|
294
|
+
| 2.3 Notation and representation . . . . . . . . . . . . . . . . . . . . . . . 17
|
295
|
+
| 2.4 Illustration for pair-wise registration . . . . . . . . . . . . . . . . . . 19
|
296
|
+
| 2.5 Optical flow and image plane correspondence . . . . . . . . . . . . . . 20
|
297
|
+
| 2.6 Silhouette points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
|
298
|
+
| 2.7 Optimizing the map . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
|
299
|
+
| 2.8 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
|
300
|
+
| 2.9 Analysis on computational time . . . . . . . . . . . . . . . . . . . . . 27
|
301
|
+
| 2.10 Visual comparisons of the generated floor plans . . . . . . . . . . . . 31
|
302
|
+
| 2.11 An possible example of extensions . . . . . . . . . . . . . . . . . . . . 32
|
303
|
+
blank |
|
304
|
+
text | 3.1 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
|
305
|
+
| 3.2 Acquisition pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
|
306
|
+
| 3.3 Hierarchical data structure. . . . . . . . . . . . . . . . . . . . . . . . 39
|
307
|
+
| 3.4 Overview of the learning phase . . . . . . . . . . . . . . . . . . . . . 42
|
308
|
+
| 3.5 Attachment of the model . . . . . . . . . . . . . . . . . . . . . . . . . 46
|
309
|
+
| 3.6 Overview of the recognition phase . . . . . . . . . . . . . . . . . . . . 47
|
310
|
+
| 3.7 Refining the segmentation . . . . . . . . . . . . . . . . . . . . . . . . 50
|
311
|
+
| 3.8 Recognition results on synthetic scans of virtual scenes . . . . . . . . 52
|
312
|
+
| 3.9 Chair models used in synthetic scenes . . . . . . . . . . . . . . . . . . 53
|
313
|
+
blank |
|
314
|
+
|
|
315
|
+
meta | xii
|
316
|
+
text | 3.10 Precision-recall curve . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
|
317
|
+
| 3.11 Various models learned/used in our test . . . . . . . . . . . . . . . . 55
|
318
|
+
| 3.12 Recognition results for various office and auditorium scenes . . . . . . 61
|
319
|
+
| 3.13 A close-up office scene . . . . . . . . . . . . . . . . . . . . . . . . . . 62
|
320
|
+
| 3.14 Comparison with an indoor labeling system . . . . . . . . . . . . . . 63
|
321
|
+
blank |
|
322
|
+
text | 4.1 A real-time guided scanning system . . . . . . . . . . . . . . . . . . . 65
|
323
|
+
| 4.2 Pipeline of the real-time guided scanning framework . . . . . . . . . . 69
|
324
|
+
| 4.3 Representative shape retrieval results . . . . . . . . . . . . . . . . . . 80
|
325
|
+
| 4.4 The proposed guided real-time scanning setup . . . . . . . . . . . . . 81
|
326
|
+
| 4.5 Retrieval results with simulated data using a chair data set . . . . . . 82
|
327
|
+
| 4.6 Retrieval results with simulated data using a couch data set . . . . . 83
|
328
|
+
| 4.7 Retrieval results with simulated data using a lamp data set . . . . . . 84
|
329
|
+
| 4.8 Retrieval results with simulated data using a table data set . . . . . . 85
|
330
|
+
| 4.9 Comparison between retrieval with view-dependant and merged scans 86
|
331
|
+
| 4.10 Effect of density-aware sampling . . . . . . . . . . . . . . . . . . . . . 87
|
332
|
+
| 4.11 Effect of noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
|
333
|
+
| 4.12 Real-time retrieval results on various datasets . . . . . . . . . . . . . 88
|
334
|
+
blank |
|
335
|
+
|
|
336
|
+
|
|
337
|
+
|
|
338
|
+
meta | xiii
|
339
|
+
| Chapter 1
|
340
|
+
blank |
|
341
|
+
title | Introduction
|
342
|
+
blank |
|
343
|
+
text | Acquiring a 3-D model of a real-world object, also known as 3-D reconstruction
|
344
|
+
| technology, has long been a challenge for various applications, including robotics
|
345
|
+
| navigation, 3-D modeling of virtual worlds, augmented reality, computer graphics,
|
346
|
+
| and manufacturing. In the graphics community, a 3-D model is typically acquired in a
|
347
|
+
| carefully calibrated set-up with highly accurate laser scans, followed by a complicated
|
348
|
+
| off-line process from scan registration to surface reconstruction. Because this is a very
|
349
|
+
| long process that requires special equipment, only a limited number of objects can be
|
350
|
+
| modeled, and the method cannot be scaled to larger environments.
|
351
|
+
| One of the most common applications of a large-scale 3-D reconstruction comes
|
352
|
+
| from modeling of urban environments. To build a model, a vehicle equipped with
|
353
|
+
| different sensors drives along roads and collects a large amount of data from lasers,
|
354
|
+
| GPS signals, wheel counters, cameras, etc. The data is then processed and stored in a
|
355
|
+
| compact form which includes important roads, buildings, parking lots. The mapped
|
356
|
+
| environments are used frequently in cell-phone applications, mapping technology or
|
357
|
+
| navigation tools.
|
358
|
+
| However, we cannot simply extend the same technology used in the 3-D reconstruc-
|
359
|
+
| tion of urban environments to indoor environments. First, unlike urban environments,
|
360
|
+
| where permanent roads exist, there are no clearly defined pathways that people must
|
361
|
+
| follow in an indoor environment. Occupants walk in various patterns around an in-
|
362
|
+
| door area, and often the space is cluttered, which could result in safety issues if, say,
|
363
|
+
blank |
|
364
|
+
|
|
365
|
+
meta | 1
|
366
|
+
| CHAPTER 1. INTRODUCTION 2
|
367
|
+
blank |
|
368
|
+
|
|
369
|
+
|
|
370
|
+
text | a robot with sensors drives within the area. Second, an indoor environment is not
|
371
|
+
| static. As residents and workers of the building engage in daily activities in interior
|
372
|
+
| environments, many objects are moved around or disappear, and new objects can be
|
373
|
+
| introduced. Third, interior shapes are much more complex compared to the outdoor
|
374
|
+
| surfaces of buildings, and it cannot simply be assumed that the objects present in a
|
375
|
+
| space are composed of flat surfaces as is generally the case in outdoor urban settings.
|
376
|
+
| Last, the modality of sensors used for outdoor mapping is not suitable for interior
|
377
|
+
| mapping and needs to be changed. A GPS signal does not work in indoor environ-
|
378
|
+
| ments, and the lighting conditions can vary significantly from one space to another
|
379
|
+
| compared to relatively constant sunlight outdoors.
|
380
|
+
| Yet, 3-D reconstruction of indoor environments also have a variety of potential
|
381
|
+
| applications. After a 3-D model of an indoor environment is acquired, the model
|
382
|
+
| could be used for interior design, indoor navigation, surveillance, or understanding
|
383
|
+
| the interior layouts and existence of objects in a space. Depending on the applications
|
384
|
+
| for which the reconstructed model would be used, the distance range and level of detail
|
385
|
+
| needed can vary as well.
|
386
|
+
| Recently, real-time 3-D sensors, such as the RGB-D sensors, a light-weight com-
|
387
|
+
| modity device, have been specifically designed to function in indoor environments and
|
388
|
+
| used to provide real-time 3-D data. Although the data captured from these sensors
|
389
|
+
| suffer from a limited field of view and complex noise characteristics, and therefore
|
390
|
+
| might not be suitable for accurate 3-D reconstruction, it can be used for everyday
|
391
|
+
| users to easily capture and utilize 3-D information of indoor environments. The work
|
392
|
+
| presented in this dissertation uses the data captured from RGB-D cameras with the
|
393
|
+
| goal of providing a useful 3-D acquisition while overcoming the limitations of the
|
394
|
+
| captured data. To do this, we have assumed different geometric priors depending on
|
395
|
+
| the targeted applications.
|
396
|
+
| In the remainder of this chapter, we first describe the characteristics of RGB-
|
397
|
+
| D camera sensors (Section 1.1). The subsequent section (Section 1.2) presents our
|
398
|
+
| approach to acquire 3-D indoor environments. The chapter concludes with an outline
|
399
|
+
| of the remainder of the dissertation (Section 1.3).
|
400
|
+
meta | CHAPTER 1. INTRODUCTION 3
|
401
|
+
blank |
|
402
|
+
|
|
403
|
+
|
|
404
|
+
title | 1.1 Background on RGB-D Cameras
|
405
|
+
text | Building a 3-D model of actual objects enables the real world to be connected to a
|
406
|
+
| virtual world. After obtaining a digital model from a real-world object, the model can
|
407
|
+
| be used in various applications. A benefit of 3D modeling is that the digital object
|
408
|
+
| can be saved and altered freely without an actual space being damaged or destroyed.
|
409
|
+
| Until recently, it was not possible for non-expert users to capture real-world envi-
|
410
|
+
| ronments in 3D because of the complexity and cost of the required equipment. RGB-D
|
411
|
+
| cameras, which provide real-time depth and color information, only became available
|
412
|
+
| a few years ago. The pioneering commodity product is the X-Box Kinect [Mic10],
|
413
|
+
| launched on October 2011. Originally developed as a gaming device, the sensor pro-
|
414
|
+
| vides real-time depth streams enabling interaction between a user and a system.
|
415
|
+
| The Kinect is affordable and easy to operate for non-expert users, and the pro-
|
416
|
+
| duced data can be accessed through open-source drivers. Although the main purpose
|
417
|
+
| of the Kinect by far was motion-sensing, thus providing a real-time interface for gam-
|
418
|
+
| ing or control, the device has served many purposes and has been used as a tool to
|
419
|
+
| develop personalized applications with the help of the drivers. Some developers also
|
420
|
+
| use the device to extend computer vision-related tasks (such as object recognition
|
421
|
+
| or structure from motion) but with depth measurements augmented as an additional
|
422
|
+
| modality of input. In addition, the device can also be viewed as a 3-D sensor that
|
423
|
+
| produces 3-D pointcloud data. In our work, this is how we view the device, and the
|
424
|
+
| goal of the research presented here, as noted above, was to acquire 3-D indoor objects
|
425
|
+
| or environments using the RGB-D cameras of the Kinect sensor.
|
426
|
+
blank |
|
427
|
+
|
|
428
|
+
title | 1.1.1 Technology
|
429
|
+
text | The underlying core technology of the depth-capturing capacity of Kinect comes
|
430
|
+
| from its structured-light 3D scanner. This scanner measures the three-dimensional
|
431
|
+
| shape of an object using projected light patterns and a camera system. A typical
|
432
|
+
| scanner measuring assembly consists of one stripe projector and at least one camera.
|
433
|
+
| Projecting a narrow band of light onto a three-dimensionally shaped surface produces
|
434
|
+
| a line of illumination that appears distorted from perspectives other than that of the
|
435
|
+
meta | CHAPTER 1. INTRODUCTION 4
|
436
|
+
blank |
|
437
|
+
|
|
438
|
+
|
|
439
|
+
|
|
440
|
+
text | Figure 1.1: Triangulation principle shown by one of multiple stripes (image from
|
441
|
+
| http://en.wikipedia.org/wiki/File:1-stripesx7.svg)
|
442
|
+
blank |
|
443
|
+
text | projector, and this line can be used for an exact geometric reconstruction of the
|
444
|
+
| surface shape. A sample setup with the projected line pattern is shown in Figure 1.1.
|
445
|
+
| The displacement of the stripes can be converted into 3D coordinates, which allow
|
446
|
+
| any details on an object’s surface to be retrieved.
|
447
|
+
| An invisible structured-light scanner scans a 3-D shape of an object by projecting
|
448
|
+
| patterns with light in an invisible spectrum. The Kinect uses projecting patterns
|
449
|
+
| composed of points in infrared (IR) light to generate video data in 3D. As shown in
|
450
|
+
| Figure 1.2, the Kinect is a horizontal bar with an IR light emitter and IR sensor. The
|
451
|
+
| IR emitter emits infrared light beams, and the IR sensor reads the IR beams reflected
|
452
|
+
| back to the sensor. The reflected beams are converted into depth information that
|
453
|
+
| measures the distance between an object and the sensor. This makes capturing a
|
454
|
+
| depth image possible. The color sensor captures normal video (visible light) that is
|
455
|
+
| synchronized with the depth data. The horizontal bar of the Kinect also contains
|
456
|
+
| microphone arrays and is connected to a small base by a tilt motor. While the color
|
457
|
+
| video and microphone provide additional means for a natural user interface, in this
|
458
|
+
meta | CHAPTER 1. INTRODUCTION 5
|
459
|
+
blank |
|
460
|
+
|
|
461
|
+
|
|
462
|
+
|
|
463
|
+
text | Figure 1.2: Kinect sensor (left) and illustration of the integrated hardware (right).
|
464
|
+
| (images from http://i.msdn.microsoft.com/dynimg/IC568992.png and http://
|
465
|
+
| i.msdn.microsoft.com/dynimg/IC584396.png)
|
466
|
+
blank |
|
467
|
+
text | dissertation, we are focused on the depth-sensing capability of the device.
|
468
|
+
| The Kinect has a limited working range, mainly designed for the volume that a
|
469
|
+
| person will require while playing a game. Kinect’s official documentation1 suggests
|
470
|
+
| a working range from 0.8 m to 4 m from the sensor. The sensor has angular field
|
471
|
+
| of view of 57◦ horizontally and 43◦ vertically. When an object is out of range for
|
472
|
+
| a particular pixel, the system will return no values. The RGB video streams are
|
473
|
+
| produced in a 1280×960 resolution. However, the default RGB video stream uses 8-
|
474
|
+
| bit VGA resolution (640×480 pixels). The monochrome depth sensing video stream
|
475
|
+
| is also in VGA resolution with 11-bit depth, which provides 2,048 levels of sensitivity.
|
476
|
+
| The depth and color stream are produced at the frame rate of 30 Hz.
|
477
|
+
| The depth data is originally produced as a 2-D grid of raw depth values. The
|
478
|
+
| values in each pixel can then be converted into (x, y, z) coordinates with calibration
|
479
|
+
| data. Depending on the application, the developer can regard the 2-D grid of values
|
480
|
+
| as a depth image, or the scattered points in 3-D ((x, y, z) coordinates) as unstructured
|
481
|
+
| pointcloud data.
|
482
|
+
blank |
|
483
|
+
|
|
484
|
+
title | 1.1.2 Noise Characteristics
|
485
|
+
text | While RGB-D cameras can provide real-time depth information, the obtained mea-
|
486
|
+
| surements exhibit convoluted noise characteristics. The measurements are extracted
|
487
|
+
meta | 1
|
488
|
+
text | http://msdn.microsoft.com/en-us/library/jj131033.aspx
|
489
|
+
meta | CHAPTER 1. INTRODUCTION 6
|
490
|
+
blank |
|
491
|
+
|
|
492
|
+
|
|
493
|
+
text | from identification of corresponding points of infrared projections in image pixels,
|
494
|
+
| and there are multiple possible sources of errors: (i) calibration error both of the
|
495
|
+
| extrinsic calibration parameters, which are given as the displacement between the
|
496
|
+
| projector and cameras, and the intrinsic calibration parameters, which depend on
|
497
|
+
| the focal points and size of pixels on the sensor grid, vary for each product; (ii)
|
498
|
+
| distance-dependent quantization error – because the accuracy of measurements de-
|
499
|
+
| pends on the resolution of a pixel compared to the details of projected pattern on
|
500
|
+
| the measured object, measurements are more noisy for farther points with more se-
|
501
|
+
| vere quantization artifacts; (iii) error from ambiguous or poor projection, in which
|
502
|
+
| the cameras cannot clearly observe the projected patterns – as the measurements are
|
503
|
+
| made by identifying the projected location of the infrared pattern, the distortion of
|
504
|
+
| the projected patterns on depth boundaries or on reflective material can result in
|
505
|
+
| wrong measurements. Sometimes the system cannot locate the corresponding points
|
506
|
+
| due to occlusion by parallax, or distance range and the data is reported as missing.
|
507
|
+
| In short, the depth data exhibits highly non-linear noise characteristics, and it is very
|
508
|
+
| hard to model all of the noise analytically.
|
509
|
+
blank |
|
510
|
+
|
|
511
|
+
title | 1.2 3-D Indoor Acquisition System
|
512
|
+
text | Given the complex noise characteristics of RGB-D cameras, we assumed that the de-
|
513
|
+
| vice produces noisy pointcloud data. Instead of reverse-engineering and correcting the
|
514
|
+
| noise from each source, we overcame the limitation on data by imposing assumptions
|
515
|
+
| on the 3-D shape of the objects being scanned.
|
516
|
+
| There are three possible ways to reconstruct 3-D models from noisy data. The first
|
517
|
+
| is to overcome the limitation of data is accumulating multiple frames from slightly dif-
|
518
|
+
| ferent viewpoints [IKH+ 11]. By averaging the noise measurements and merging them
|
519
|
+
| into a single volumetric structure, a very high-quality mesh model can be recovered.
|
520
|
+
| The second is using a machine learning-based method. In this approach, multiple
|
521
|
+
| instances of measurements and actual object labels are first collected. Classifiers are
|
522
|
+
| then trained to produce the object labels given the measurements and later used to
|
523
|
+
| understand the given measurements. The third way is to assume geometric priors on
|
524
|
+
meta | CHAPTER 1. INTRODUCTION 7
|
525
|
+
blank |
|
526
|
+
|
|
527
|
+
|
|
528
|
+
text | the data being captured. Assuming that the underlying scene is not completely ran-
|
529
|
+
| dom, the shape to be reconstructed has a limited degree of freedom, and can thus be
|
530
|
+
| reconstructed by inferring the most probable shape within the scope of the assumed
|
531
|
+
| structure.
|
532
|
+
| This third way is the method used in our work. By focusing on acquiring the pre-
|
533
|
+
| defined modes or degree of freedom given the geometric priors, the acquired model
|
534
|
+
| naturally capture high-level information of the structure. In addition, the acquisition
|
535
|
+
| pipeline becomes lightweight and the entire process can stay real-time. Because the in-
|
536
|
+
| put data stream is also real-time, there is possibility of incorporating user-interaction
|
537
|
+
| during the capturing process.
|
538
|
+
blank |
|
539
|
+
|
|
540
|
+
title | 1.3 Outline of the Dissertation
|
541
|
+
text | The chapters to follow, outlined below, discuss in detail the specific approaches we
|
542
|
+
| took to mitigate the problems inherent in indoor reconstruction from noisy sensor
|
543
|
+
| data.
|
544
|
+
| Chapter 2 discusses a pipeline used to acquire floor plans in residential areas. The
|
545
|
+
| proposed system is quick and convenient compared to the common pipeline used to
|
546
|
+
| acquire floor plans from manual sketching and measurements, which are frequently
|
547
|
+
| required for remodeling or selling a property. We posit that the world is composed of
|
548
|
+
| relatively large, flat surfaces that meet at right angles. We focus on continuous collec-
|
549
|
+
| tion of points that occupy large, flat areas and align with the axes and ignoring other
|
550
|
+
| points. Even with very noisy data, the process can be performed at an interactive
|
551
|
+
| rate since the space of possible plane arrangements is sparse given the measurements.
|
552
|
+
| We take advantage of real-time data and allow users to provide intuitive feedback
|
553
|
+
| to assist the acquisition pipeline. The research described in the chapter was first
|
554
|
+
| published as Y.M. Kim, J. Dolson, M. Sokolsky, V. Koltun, S.Thrun, Interactive
|
555
|
+
| Acquisition of Residential Floor Plans, IEEE International Conference on Robotics
|
556
|
+
| and Animation (ICRA), 2012 c 2012 IEEE, and the contents were also replicated
|
557
|
+
| with small modifications.
|
558
|
+
meta | CHAPTER 1. INTRODUCTION 8
|
559
|
+
blank |
|
560
|
+
|
|
561
|
+
|
|
562
|
+
text | Chapter 3 discusses how we targeted public spaces with many repeating ob-
|
563
|
+
| jects in different poses or variation modes. Even though indoor environments can
|
564
|
+
| frequently change, we can identify patterns and possible movements by reasoning
|
565
|
+
| in object-level. Especially in public buildings (offices, cafeterias, auditoriums, and
|
566
|
+
| seminar rooms), chairs, tables, monitors, etc, are repeatedly used in similar pat-
|
567
|
+
| terns. We first build abstract models of the objects of interest with simple geometric
|
568
|
+
| primitives and deformation modes. We then use the built models to quickly de-
|
569
|
+
| tect the objects of interest within an indoor scene that the objects repeatedly ap-
|
570
|
+
| pear. While the models are simple approximation of actual complex geometry, we
|
571
|
+
| demonstrate that the models are sufficient to detect the object within noisy, par-
|
572
|
+
| tial indoor scene data. The learned variability modes not only factor out nuisance
|
573
|
+
| modes of variability (e.g., motions of chairs, etc.) from meaningful changes (e.g.,
|
574
|
+
| security, where the new scene objects should be flagged), but also provide the func-
|
575
|
+
| tional modes of the object (the status of open drawers, closed laptop, etc.), which
|
576
|
+
| potentially provide high-level understanding of the scene. The study discussed have
|
577
|
+
| first appeared as a publication, Young Min Kim, Niloy J. Mitra, Dong-Ming Yan,
|
578
|
+
| and Leonidas Guibas. 2012. Acquiring 3D indoor environments with variability and
|
579
|
+
| repetition. ACM Trans. Graph. 31, 6, Article 138 (November 2012), 11 pages.
|
580
|
+
| DOI=10.1145/2366145.2366157 http://doi.acm.org/10.1145/2366145.2366157, from
|
581
|
+
| which the major written parts of the chapter were adapted.
|
582
|
+
| Chapter 4 discusses a reconstruction approach that utilizes 3-D models down-
|
583
|
+
| loaded from the web to assist in understanding the objects being scanned. The data
|
584
|
+
| stream from an RGB-D camera is noisy and exhibit lots of missing data, making it
|
585
|
+
| very hard to accurately build a full model of an object being scanned. We take an
|
586
|
+
| approach to use a large database of 3-D models to match against partial, noisy scan
|
587
|
+
| of the input data stream. To this end, we propose a simple, efficient, yet discrimina-
|
588
|
+
| tive descriptor that can be evaluated in real-time and used to process complex indoor
|
589
|
+
| scenes. The matching models are quickly found from the database with help of our
|
590
|
+
| proposed shape descriptor. This also allows real-time assessment of the quality of the
|
591
|
+
| data captured, and the system provides the user with real-time feedback on where to
|
592
|
+
| scan. Eventually the user can retrieve the closest model as quickly as possible during
|
593
|
+
meta | CHAPTER 1. INTRODUCTION 9
|
594
|
+
blank |
|
595
|
+
|
|
596
|
+
|
|
597
|
+
text | the scanning session. The research and contents of the chapter will be published as
|
598
|
+
| Y.M. Kim, N. Mitra, Q. Huang, L. Guibas, Guided Real-Time Scanning of Indoor
|
599
|
+
| Environments, Pacific Graphics 2013.
|
600
|
+
| Chapter 5 concludes the dissertation with a summary of our work and a discussion
|
601
|
+
| of future directions this research could take.
|
602
|
+
blank |
|
603
|
+
|
|
604
|
+
title | 1.3.1 Contributions
|
605
|
+
text | The major contribution of the dissertation is to present methods to quickly acquire
|
606
|
+
| 3-D information from noisy, occluded pointcloud data by assuming geometric pri-
|
607
|
+
| ors. The pre-defined modes not only provide high-level understanding of the current
|
608
|
+
| mode, but also allow the data size to stay compact, which, in turn, saves memory
|
609
|
+
| and processing time. The proposed geometric priors have been previously used for
|
610
|
+
| different settings, but our approach incorporate the priors tuned for the practical
|
611
|
+
| tasks at hand with real scans from RGB-D data acquired from actual environments.
|
612
|
+
| The example geometric priors that are covered are as following:
|
613
|
+
blank |
|
614
|
+
text | • Based on Manhattan world assumption, important architectural elements (walls,
|
615
|
+
| floor and ceiling) can be retrieved in real-time.
|
616
|
+
blank |
|
617
|
+
text | • By building an abstract model composed of simple geometric primitives and joint
|
618
|
+
| information between primitives, objects under severe occlusion and different
|
619
|
+
| configuration can be located. The bottom-up approach can quickly populate
|
620
|
+
| large indoor environments with variability and repetition (around 200 ms per
|
621
|
+
| object).
|
622
|
+
blank |
|
623
|
+
text | • Online public database of 3-D models recover the structure of objects from
|
624
|
+
| partial, noisy scans in a matter of seconds. We developed a relation-based
|
625
|
+
| lightweight descriptor for fast and accurate model retrieval and positioning.
|
626
|
+
blank |
|
627
|
+
text | We also take an advantage of the representation and demonstrate quick and effi-
|
628
|
+
| cient pipeline, including user-interaction when possible. More specifically, we demon-
|
629
|
+
| strate following novel prototypes of systems:
|
630
|
+
meta | CHAPTER 1. INTRODUCTION 10
|
631
|
+
blank |
|
632
|
+
|
|
633
|
+
|
|
634
|
+
text | • A new hand-held system that a user can capture the space and automatically
|
635
|
+
| generate a floor plan. The user does not have to measure distances or manually
|
636
|
+
| sketch the layout.
|
637
|
+
blank |
|
638
|
+
text | • A projector attached to the RGB-D camera to communicate current status of
|
639
|
+
| the acquisition on the physical surface with user, and thus allow user to provide
|
640
|
+
| intuitive feedback.
|
641
|
+
blank |
|
642
|
+
text | • A real-time guided scanning setup for online quality assessment of streaming
|
643
|
+
| RGB-D data obtained with help of 3-D database of models.
|
644
|
+
blank |
|
645
|
+
text | While the specific geometric priors and prototypes listed above come from under-
|
646
|
+
| standing of the characteristic of the task at hand, the underlying assumptions and
|
647
|
+
| approach provide a direction to allow everyday user to acquire useful 3-D information
|
648
|
+
| in the years to come as real-time 3-D scans become available.
|
649
|
+
meta | Chapter 2
|
650
|
+
blank |
|
651
|
+
title | Interactive Acquisition of
|
652
|
+
| Residential Floor Plans1
|
653
|
+
blank |
|
654
|
+
text | Acquiring an accurate floor plan of a residence is a challenging task, yet one that
|
655
|
+
| is required for many situations, such as remodeling or sale of a property. Original
|
656
|
+
| blueprints can be difficult to find, especially for older residences. In practice, contrac-
|
657
|
+
| tors and interior designers use point-to-point laser measurement devices to acquire
|
658
|
+
| a set of distance measurements. Based on these measurements, an expert creates a
|
659
|
+
| floor plan that respects the measurements and represents the layout of the residence.
|
660
|
+
| Both taking measurements and representing the layout are cumbersome manual tasks
|
661
|
+
| that require experience and time.
|
662
|
+
| In this chapter, we present a hand-held system for indoor architectural reconstruc-
|
663
|
+
| tion. This system eliminates the manual post-processing necessary for reconstructing
|
664
|
+
| the layout of walls in a residence. Instead, an operator with no architectural exper-
|
665
|
+
| tise can interactively guide the reconstruction process by moving freely through an
|
666
|
+
meta | 1
|
667
|
+
text | The contents of the chapter was originally published as Y.M. Kim, J. Dolson, M. Sokolsky, V.
|
668
|
+
| Koltun, S.Thrun, Interactive Acquisition of Residential Floor Plans, IEEE International Conference
|
669
|
+
| on Robotics and Animation (ICRA), 2012 c 2012 IEEE.
|
670
|
+
| In reference to IEEE copyrighted material which is used with permission in this thesis, the
|
671
|
+
| IEEE does not endorse any of Stanford University’s products or services. Internal or personal
|
672
|
+
| use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material
|
673
|
+
| for advertising or promotional purposes or for creating new collective works for resale or redis-
|
674
|
+
| tribution, please go to http://www.ieee.org/publications_standards/publications/rights/
|
675
|
+
| rights_link.html to learn how to obtain a License from RightsLink.
|
676
|
+
blank |
|
677
|
+
|
|
678
|
+
meta | 11
|
679
|
+
| CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 12
|
680
|
+
blank |
|
681
|
+
|
|
682
|
+
|
|
683
|
+
|
|
684
|
+
|
|
685
|
+
|
|
686
|
+
|
|
687
|
+
|
|
688
|
+
|
|
689
|
+
|
|
690
|
+
|
|
691
|
+
|
|
692
|
+
|
|
693
|
+
|
|
694
|
+
|
|
695
|
+
|
|
696
|
+
|
|
697
|
+
|
|
698
|
+
|
|
699
|
+
|
|
700
|
+
|
|
701
|
+
|
|
702
|
+
|
|
703
|
+
|
|
704
|
+
|
|
705
|
+
|
|
706
|
+
|
|
707
|
+
|
|
708
|
+
|
|
709
|
+
|
|
710
|
+
|
|
711
|
+
|
|
712
|
+
|
|
713
|
+
|
|
714
|
+
|
|
715
|
+
|
|
716
|
+
|
|
717
|
+
|
|
718
|
+
|
|
719
|
+
|
|
720
|
+
text | Figure 2.1: Our hand-held system is composed of a projector, a Microsoft Kinect
|
721
|
+
| sensor, and an input button (left). The system uses augmented reality feedback
|
722
|
+
| (middle left) to project the status of the current model onto the environment and to
|
723
|
+
| enable real-time acquisition of residential wall layouts (middle right). The floor plan
|
724
|
+
| (middle right) and visualization (right) were generated using data captured by our
|
725
|
+
| system.
|
726
|
+
blank |
|
727
|
+
text | interior with the hand-held system until all walls have been observed by the sensor
|
728
|
+
| in the system.
|
729
|
+
| Our system is composed of a laptop connected to an RGB-D camera, a lightweight
|
730
|
+
| optical projector, and an input button interface (Figure 2.1, left). The RGB-D cam-
|
731
|
+
| era is a real-time depth sensor that acts as the main input modality. As noted in
|
732
|
+
| Chapter 1, we use the Microsoft Kinect, a lightweight commodity device that out-
|
733
|
+
| puts VGA-resolution range and color images at video rates. The data is processed
|
734
|
+
| in real time to create the floor plan by focusing on large flat surfaces and ignoring
|
735
|
+
| clutter. The generated floor plan can be used directly for remodeling or real-estate
|
736
|
+
| applications or to produce a 3D model of the interior for applications in virtual envi-
|
737
|
+
| ronments. In Section 2.4, we present and discuss a number of residential wall layouts
|
738
|
+
| reconstructed with our system, captured from actual apartments. Even though the
|
739
|
+
| results presented here were obtained focus on residential spaces, the system can also
|
740
|
+
| be used in other types of interior environments.
|
741
|
+
| The attached projector is initially calibrated to have an overlapping field of view
|
742
|
+
| with the same image center as the depth sensor. It projects the reconstruction status
|
743
|
+
| onto the surface being scanned. Under normal lighting, the projector does not provide
|
744
|
+
| a sophisticated rendering. Rather, the projection allows the user to visualize the
|
745
|
+
| reconstruction process. The user can then detect reconstruction errors that arise due
|
746
|
+
| to deficiencies in the data capture path and can complete missing data in response.
|
747
|
+
meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 13
|
748
|
+
blank |
|
749
|
+
|
|
750
|
+
|
|
751
|
+
text | The user can also note which walls have been included in the model and easily resolve
|
752
|
+
| ambiguities with a simple input device. The proposed system has advantages over
|
753
|
+
| other previous applications by allowing a new type of user interaction in real time that
|
754
|
+
| focuses only on architectural elements relevant to the task at hand. This difference
|
755
|
+
| is discussed in detail in the following section.
|
756
|
+
blank |
|
757
|
+
|
|
758
|
+
title | 2.1 Related Work
|
759
|
+
text | A number of approaches have been proposed for indoor reconstruction in computer
|
760
|
+
| graphics, computer vision, and robotics. Real-time indoor reconstruction using either
|
761
|
+
| a depth sensor [HKH+ 12] or an optical camera [ND10] has been recently explored.
|
762
|
+
| The results at these studies suggest that the key to real-time performance is the
|
763
|
+
| fast registration of successive frames. Similar to [HKH+ 12], we fuse both color and
|
764
|
+
| depth information to register frames. Furthermore, our approach extends real-time
|
765
|
+
| acquisition and reconstruction by allowing the operator to visualize the current re-
|
766
|
+
| construction status without consulting a computer screen. Because the feedback loop
|
767
|
+
| in our system is immediate, the operator can resolve failures and ambiguities while
|
768
|
+
| the acquisition session is in progress.
|
769
|
+
| Previous approaches have also been limited to a dense 3-D reconstruction (reg-
|
770
|
+
| istration of point cloud data) with no higher-level information, which is memory
|
771
|
+
| intensive. A few exceptions include [GCCMC08], by means of which high-level fea-
|
772
|
+
| tures (lines and planes) are detected to reduce complexity and noise. The high-level
|
773
|
+
| structures, however, do not necessarily correspond to actual architectural elements,
|
774
|
+
| such as walls, floors, or ceilings. In contrast, our system identifies and focuses on
|
775
|
+
| significant architectural elements using the Manhattan-world assumption, which is
|
776
|
+
| based on the observation that many indoor scenes are largely rectilinear [CY99]. This
|
777
|
+
| assumption is widely made for indoor scene reconstruction from images to overcome
|
778
|
+
| the inherent limitations of image data [FCSS09][VAB10]. While the traditional stereo
|
779
|
+
| method only reconstructs 3-D locations of image feature points, the Manhattan-world
|
780
|
+
| assumption successfully fills an area between the sparse feature points during post-
|
781
|
+
| processing. Our system, based on the Manhattan-world assumption, differentiates
|
782
|
+
meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 14
|
783
|
+
blank |
|
784
|
+
|
|
785
|
+
|
|
786
|
+
text | between architectural features and miscellaneous objects in the space, producing a
|
787
|
+
| clean architectural floor plan and simplifying the representation of the environment.
|
788
|
+
| Even with the Manhattan-world assumption, however, the system still cannot fully
|
789
|
+
| resolve ambiguities introduced by large furniture items and irregular features in the
|
790
|
+
| space without user input. The interactive capability offered by our system allows the
|
791
|
+
| user to easily disambiguate the situation and integrate new input into a global map
|
792
|
+
| of the space in real time.
|
793
|
+
| Not only does our system simplify the representation of the feature of a space, but
|
794
|
+
| by doing so it reduces the computational burden of processing a map. Employing the
|
795
|
+
| Manhattan-world assumption simplifies the map construction to a one-dimensional,
|
796
|
+
| closed-form problem. Registration of successive point clouds results in an accumula-
|
797
|
+
| tion of errors, especially for a large environment, and requires a global optimization
|
798
|
+
| step in order to build a consistent map. This is similar to reconstruction tasks en-
|
799
|
+
| countered in robotic mapping. In other approaches, the problem is usually solved by
|
800
|
+
| bundle adjustment, a costly off-line process [TMHF00][Thr02].
|
801
|
+
| The augmented reality component of our system is inspired by the SixthSense
|
802
|
+
| project [MM09]. Instead of simply augmenting a user’s view of the world, however,
|
803
|
+
| our projected output serves to guide an interactive reconstruction process. Directing
|
804
|
+
| the user in this way is similar to re-photography [BAD10], where a user is guided
|
805
|
+
| to capture a photograph from the same viewpoint as in a previous photograph. By
|
806
|
+
| using a micro-projector as the output modality, our system allows the operator to
|
807
|
+
| focus on interacting with the environment.
|
808
|
+
blank |
|
809
|
+
|
|
810
|
+
title | 2.2 System Overview and Usage
|
811
|
+
text | The data acquisition process is initiated by the user pointing the sensor to a corner,
|
812
|
+
| where three mutually orthogonal planes meet. This corner defines the Manhattan-
|
813
|
+
| world coordinate system. The attached projector indicates successful initialization by
|
814
|
+
| overlaying blue-colored planes with white edges onto the scene (Figure 2.2 (a)). After
|
815
|
+
| the initialization, the user scans each room individually as he or she loops around in
|
816
|
+
| it holding the device. If the movement is too fast or if there are not enough features,
|
817
|
+
meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 15
|
818
|
+
blank |
|
819
|
+
|
|
820
|
+
|
|
821
|
+
|
|
822
|
+
text | Fetch a new frame
|
823
|
+
blank |
|
824
|
+
text | Exists
|
825
|
+
| Global
|
826
|
+
| Success
|
827
|
+
| adjustment
|
828
|
+
| Pair-wise
|
829
|
+
| Initialization Plane extraction
|
830
|
+
| registration
|
831
|
+
blank |
|
832
|
+
text | Map update
|
833
|
+
| New
|
834
|
+
blank |
|
835
|
+
text | User interaction
|
836
|
+
| Failure Left click Right click
|
837
|
+
| Adjust data Start a new
|
838
|
+
| Visual feedback Select planes
|
839
|
+
| path room
|
840
|
+
blank |
|
841
|
+
|
|
842
|
+
|
|
843
|
+
|
|
844
|
+
text | (a) (b) (c)
|
845
|
+
blank |
|
846
|
+
|
|
847
|
+
text | Figure 2.2: System overview and usage. When an acquisition session is initiated by
|
848
|
+
| observing a corner, the user is notified by a blue projection (a). After the initial-
|
849
|
+
| ization, the system updates the camera pose by registering consecutive frames. If a
|
850
|
+
| registration failure occurs, the user is notified by a red projection and is required to
|
851
|
+
| adjust the data capture path (b). Otherwise, the updated camera configuration is
|
852
|
+
| used to detect planes that satisfy the Manhattan-world assumption in the environ-
|
853
|
+
| ment and to integrate them into the global map. The user interacts with the system
|
854
|
+
| by selecting planes in the space (c). When the acquisition session is completed, the
|
855
|
+
| acquired map is used to construct a floor plan consisting of user-selected planes.
|
856
|
+
meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 16
|
857
|
+
blank |
|
858
|
+
|
|
859
|
+
|
|
860
|
+
text | a red projection on the surface guides the user to recover the position of the device
|
861
|
+
| (Figure 2.2 (b)) and re-acquire that area.
|
862
|
+
| The system extracts flat surfaces that align with the Manhattan coordinate system
|
863
|
+
| and creates complete rectilinear polygons, even when connectivity between planes is
|
864
|
+
| occluded. At times, the user might not want some of the extracted planes (parts
|
865
|
+
| of furniture or open doors) to be included in the model even if these planes satisfy
|
866
|
+
| the Manhattan-world assumption. In these cases, when the user clicks the input
|
867
|
+
| button (left click), the extracted wall toggles between inclusion (indicated in blue)
|
868
|
+
| and exclusion (indicated in grey) to the model (Figure 2.2 (c)). As the user finishes
|
869
|
+
| scanning a room, he or she can move to another room and scan it. A new rectilinear
|
870
|
+
| polygon is initiated by a right click. Another rectilinear polygon is similarly created
|
871
|
+
| by including the selected planes, and the room is correctly positioned into the global
|
872
|
+
| coordinate system. The model is updated in real time and stored in either a CAD
|
873
|
+
| format or a 3-D mesh format that can be loaded into most 3-D modeling software.
|
874
|
+
blank |
|
875
|
+
|
|
876
|
+
title | 2.3 Data Acquisition Process
|
877
|
+
text | Some notations used throughout the section are introduced in Figure 2.3. At each
|
878
|
+
| time step t, the sensor produces a new frame of data, Ft = {Xt , It }, composed
|
879
|
+
| of a range image Xt (a 2-D array of depth measurements) and a color image It ,
|
880
|
+
| Figure 2.3(a). T t represents the transformation from the frame Ft , measured from
|
881
|
+
| the current sensor position, to the global coordinate system, which is where the map
|
882
|
+
| Mt = {Ltr , Rtr } is defined, Figure 2.3(b). Throughout the data capture session, the
|
883
|
+
| system maintains the global map Mt , and the two most recent frames, Ft−1 and Ft
|
884
|
+
| to update the transformation information. Instead of storing information from all
|
885
|
+
| frames, the system keeps the total computational and memory requirements minimal
|
886
|
+
| by incrementally updating the global map only with components that need to be
|
887
|
+
| added to the final model. Additionally, the frame with the last observed corner Fc is
|
888
|
+
| stored to recover the sensor position when lost.
|
889
|
+
| After the transformation is found, the relationship between the planes in global
|
890
|
+
| map Mt and the measurement in the current frame Xt is represented as Pt , a 2-D
|
891
|
+
meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 17
|
892
|
+
blank |
|
893
|
+
|
|
894
|
+
|
|
895
|
+
|
|
896
|
+
text | Xt
|
897
|
+
| P3
|
898
|
+
| P2
|
899
|
+
| P6
|
900
|
+
| t P4
|
901
|
+
| I P0 P5
|
902
|
+
| T t (F t )
|
903
|
+
| P7 P8
|
904
|
+
blank |
|
905
|
+
|
|
906
|
+
text | (a) F t (b) Ltr
|
907
|
+
| P3
|
908
|
+
| P2
|
909
|
+
blank |
|
910
|
+
text | P4
|
911
|
+
| P0 P5
|
912
|
+
| P3 P5
|
913
|
+
| P6
|
914
|
+
| P7
|
915
|
+
blank |
|
916
|
+
text | (c) P t (d) R tr
|
917
|
+
blank |
|
918
|
+
text | Figure 2.3: Notation and representation. Each frame of the sensor Ft is composed of
|
919
|
+
| a 2-D array of depth measurements Xt and color image It (a). The global map Mt
|
920
|
+
| is composed of sequence of observed planes Ltr (b) and loops of rectilinear polygons
|
921
|
+
| built from the planes Rtr (d). After the registration of the current frame T t is found
|
922
|
+
| with respect to the global coordinate system, planes are extracted Pt (c), the system
|
923
|
+
| automatically update the room structure based on the observation Rtr (d).
|
924
|
+
meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 18
|
925
|
+
blank |
|
926
|
+
|
|
927
|
+
|
|
928
|
+
text | array of plane labels for each pixel, Figure 2.3(c). The map Mt is composed of lists of
|
929
|
+
| observed axis-parallel planes Ltr and loops of current room structure Rtr , defined with
|
930
|
+
| subsets of the planes from Ltr . Each plane has its axis label (x, y, or z) and the offset
|
931
|
+
| value (e.g., x = x0 ), as well as its left or right plane if the connectivity is observed. A
|
932
|
+
| plane can be selected (shown as solid line in Figure 2.3(b)) or ignored (dotted line in
|
933
|
+
| Figure 2.3(b)) based on user input. The selected planes are extracted from Ltr as the
|
934
|
+
| loop of the room Rtr , which can be converted into the floor plan as a 2-D rectilinear
|
935
|
+
| polygon. To have a fully connected a rectilinear polygon per room, Rtr is constrained
|
936
|
+
| to have alternating axis labels (x and y). For the z direction (vertical direction), the
|
937
|
+
| system retains only the ceiling and the floor. The system also keeps the sequence of
|
938
|
+
| observation (S x , S y , and S z ) of offset values for each axis direction, and stores the
|
939
|
+
| measured distance and the uncertainty of the measurement between planes.
|
940
|
+
| The overall reconstruction process is summarized in Figure 2.2. As mentioned in
|
941
|
+
| Sec. 2.2, this process is initiated by extracting three mutually orthogonal planes when
|
942
|
+
| a user points the system to one of the corners or a room. To detect planes in the range
|
943
|
+
| data, our system fits plane equations to groups of range points and their corresponding
|
944
|
+
| normals using the RANSAC algorithm [FB81]: the system first randomly samples a
|
945
|
+
| few points, then fits a plane equation to them. the system then tests the detected
|
946
|
+
| plane by counting the number of points that can be explained by the plane equation.
|
947
|
+
| After convergence, the detected plane is classified as valid only if the detected points
|
948
|
+
| constitute a large, connected portion of the depth information within the frame. If
|
949
|
+
| there are three planes detected, and they are orthogonal to each other, our system
|
950
|
+
| assigns the x, y and z axes to be the normal directions of these three planes, which
|
951
|
+
| form the right-handed coordinate system for our Manhattan world. Now the map Mt
|
952
|
+
| has two planes (the floor or ceiling is excluded), and the transformation T t between
|
953
|
+
| Mt and Ft is also found.
|
954
|
+
| A new measurement Ft is registered with the previous frame Ft−1 by aligning
|
955
|
+
| depth and color features (Sec. 2.3.1). This registration is used to update T t−1 to a
|
956
|
+
| new transformation T t . The system extracts planes that satisfy the Manhattan-world
|
957
|
+
| assumption from T t (Ft ) (Sec. 2.3.2). If the extracted planes already exist in Ltr , the
|
958
|
+
| current measurement is compared with the global map and the registration is refined
|
959
|
+
meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 19
|
960
|
+
blank |
|
961
|
+
|
|
962
|
+
|
|
963
|
+
|
|
964
|
+
text | (a) (b) (c) (d)
|
965
|
+
blank |
|
966
|
+
text | Figure 2.4: (a) Flat wall features (depicted by the triangle and circle) are observed
|
967
|
+
| from two different locations. Diagram (b) shows both observations with respect to
|
968
|
+
| the camera coordinate system. Without features, using projection-based ICP can
|
969
|
+
| lead to registration errors in the image-plane direction (c), while the use of features
|
970
|
+
| will provide better registration (d).
|
971
|
+
blank |
|
972
|
+
text | (Sec. 2.3.3). If there is a new plane extracted, or if there is user input to specify the
|
973
|
+
| map structure, the map is updated accordingly (Sec. 2.3.4).
|
974
|
+
blank |
|
975
|
+
|
|
976
|
+
title | 2.3.1 Pair-Wise Registration
|
977
|
+
text | To propagate information from previous frames and to detect new planes in the scene,
|
978
|
+
| each incoming frame must be registered with respect to the global coordinate system.
|
979
|
+
| To start this process, the system finds the relative registration between the two most
|
980
|
+
| recent frames, Ft−1 and Ft . By using both the depth point clouds (Xt−1 , Xt ) and
|
981
|
+
| optical images (It−1 , It ), the system can efficiently register frames in real time (about
|
982
|
+
| 15 fps).
|
983
|
+
| Given two sets of point clouds, Xt−1 = {xt−1 N t t N
|
984
|
+
| i }i=1 and X = {xi }i=1 , and the
|
985
|
+
| transformation for the previous point cloud T t−1 , the correct rigid transformation T t
|
986
|
+
| will minimize the error between correspondences in the two sets:
|
987
|
+
blank |
|
988
|
+
text | X
|
989
|
+
| mint kwi (T t−1 (xit−1 ) − T t (yit ))k2 (2.1)
|
990
|
+
| yi ,T
|
991
|
+
| i
|
992
|
+
blank |
|
993
|
+
text | yit ∈ Xt is the corresponding point for xt−1
|
994
|
+
| i ∈ Xt−1 . Once the correspondence is
|
995
|
+
| known, minimizing Eq. (2.1) becomes a closed-form solution [BM92]. In conventional
|
996
|
+
| approaches, correspondence is found by searching for the closest point, which is com-
|
997
|
+
| putationally expensive. Real-time registration methods reduce the cost by projecting
|
998
|
+
meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 20
|
999
|
+
blank |
|
1000
|
+
|
|
1001
|
+
|
|
1002
|
+
|
|
1003
|
+
text | (a) it−1 ∈ It−1 (b) j t ∈ It (c) Ht (It−1 ) (d) |I t − Ht (It−1 )|
|
1004
|
+
blank |
|
1005
|
+
text | Figure 2.5: From optical flow between two consecutive frames, sparse image features
|
1006
|
+
| are matched between (a) it−1 ∈ It−1 and (b) j t ∈ It . The matched features are then
|
1007
|
+
| used to calculate homography Ht such that the previous image It−1 can be warped to
|
1008
|
+
| the space of the current image It and create dense projective correspondences (c). The
|
1009
|
+
| difference image (d) shows that most of dense correspondences are within a few-pixel
|
1010
|
+
| error in image plane with slight offset around silhouette area.
|
1011
|
+
blank |
|
1012
|
+
text | the 3-D points onto a 2-D image plane and assigning correspondences to points that
|
1013
|
+
| project onto the same pixel locations [RL01]. However, projection will only reduce the
|
1014
|
+
| distance in the ray direction; the offset parallel to the image plane cannot be adjusted.
|
1015
|
+
| This phenomenon can result in the algorithm not compensating for the translation
|
1016
|
+
| parallel to the plane and therefore shrinking the size of the room (Figure 2.4).
|
1017
|
+
| Our pair-wise registration is similar to [RL01], but it compensates for the dis-
|
1018
|
+
| placement parallel to the image plane using image features and silhouette points.
|
1019
|
+
| Intuitively, the system uses homography to compensate for errors parallel to the
|
1020
|
+
| plane if the structure can be approximated into a plane, and silhouette points are
|
1021
|
+
| used to compensate for remaining errors when the features are not planar.
|
1022
|
+
| Our system first computes the optical flow between color images It and It−1 and
|
1023
|
+
| finds a sparse set of features matched between them, Figure 2.5(a)(b). The sparse set
|
1024
|
+
| of features then can be used to create dense projective correspondence between the
|
1025
|
+
| two frames, Figure 2.5(c)(d). More specifically, homography is a transform between
|
1026
|
+
| 2-D homogeneous coordinates defined by a matrix H ∈ R3×3 :
|
1027
|
+
blank |
|
1028
|
+
text |
|
1029
|
+
| ui wuj
|
1030
|
+
| X
|
1031
|
+
| kHit−1 − j t k2 , where it−1
|
1032
|
+
|
|
1033
|
+
| min = v i
|
1034
|
+
| ∈ It−1 , j t = wvj ∈ It (2.2)
|
1035
|
+
| H
|
1036
|
+
| it−1 ,j t
|
1037
|
+
| 1 w
|
1038
|
+
meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 21
|
1039
|
+
blank |
|
1040
|
+
|
|
1041
|
+
|
|
1042
|
+
|
|
1043
|
+
text | Figure 2.6: Silhouette points. There are two different types of depth discontinuity:
|
1044
|
+
| the boundaries of a shadow made on the background by a foreground object (empty
|
1045
|
+
| circles), and the boundaries of a foreground object (filled circles). The meaningful
|
1046
|
+
| depth features are the foreground points, which are the silhouette points used for our
|
1047
|
+
| registration pipeline.
|
1048
|
+
blank |
|
1049
|
+
text | Compared to naive projective correspondence used in [RL01], a homography de-
|
1050
|
+
| fines a map between two planar surfaces in 3-D space. The homography represents
|
1051
|
+
| the displacement parallel to the image plane, and is used to compute dense corre-
|
1052
|
+
| spondences between the two frames. While a homography does not represent a full
|
1053
|
+
| transformation in 3-D, the planar approximation works well in practice for our sce-
|
1054
|
+
| nario, where the scene is mostly composed of flat planes and the relative movement is
|
1055
|
+
| small. From the second iteration, the correspondence is found by projecting individual
|
1056
|
+
| points onto the image plane, as shown in [RL01].
|
1057
|
+
| Given the correspondence, the registration between the frames for the current iter-
|
1058
|
+
| ation can be given as a closed-form solution (Equation 2.1). Additionally, the system
|
1059
|
+
| modifies the correspondence for silhouette points (points of depth discontinuity in
|
1060
|
+
| the foreground, shown in Figure 2.6). For silhouette points in Xt−1 , the system finds
|
1061
|
+
| the closest silhouette points in Xt within a small search window from the original
|
1062
|
+
| corresponding location. If the matching silhouette point exists, the correspondence is
|
1063
|
+
| weighted more. (We used wi = 100 for silhouette points and wi = 1 for non-silhouette
|
1064
|
+
| points.) The process iterates until it converges.
|
1065
|
+
blank |
|
1066
|
+
title | Registration Failure
|
1067
|
+
blank |
|
1068
|
+
text | The real-time registration is a crucial part of our algorithm for accurate reconstruc-
|
1069
|
+
| tion. Even with the hybrid approach in which both color and depth features are used,
|
1070
|
+
| the registration can fail, and it is important to detect the failure immediately and
|
1071
|
+
| to recover the position of the sensor. The registration failure is detected either (1)
|
1072
|
+
meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 22
|
1073
|
+
blank |
|
1074
|
+
|
|
1075
|
+
|
|
1076
|
+
text | if the pair-wise registration does not converge or (2) if there were not enough color
|
1077
|
+
| and depth features. The first case can be easily detected as the algorithm runs. The
|
1078
|
+
| second case is detected if the optical flow did not find homography (i.e., there is a
|
1079
|
+
| lack of color feature) and there were not enough matched silhouette points (i.e., there
|
1080
|
+
| is a lack of depth feature).
|
1081
|
+
| In cases of registration failure, the projected image turns red, indicating that the
|
1082
|
+
| user should return the system’s viewpoint to the most recently observed corner. This
|
1083
|
+
| movement usually takes only a small amount of back-tracking because the failure
|
1084
|
+
| is detected within milliseconds of leaving the previous successfully registered area.
|
1085
|
+
| Similar to the initialization step, the system extracts planes from Xt using RANSAC
|
1086
|
+
| and matches the planes with the desired corner. Figure 2.2 (b) depicts the process of
|
1087
|
+
| overcoming a registration failure. The user then deliberately moves the sensor along
|
1088
|
+
| the path with richer features or steps farther from a wall to cover a wider view.
|
1089
|
+
blank |
|
1090
|
+
|
|
1091
|
+
title | 2.3.2 Plane Extraction
|
1092
|
+
text | Based on the transformation T t , the system extracts axes-aligned planes and asso-
|
1093
|
+
| ciated edges. The planes and detected features will provide higher-level information
|
1094
|
+
| that relates the raw point cloud Xt to the global map Mt . Because the system only
|
1095
|
+
| considers the planes that satisfy the Manhattan-world coordinate system, we were
|
1096
|
+
| able to simplify the plane detection procedure.
|
1097
|
+
| The planes from the previous frame that remain visible can be easily found by
|
1098
|
+
| using the correspondence. From the pair-wise registration (Sec. 2.3.1), our system
|
1099
|
+
| has the point-wise correspondence between the previous frame and the current frame.
|
1100
|
+
| The plane label Pt−1 from the previous frame is updated simply by being copied over
|
1101
|
+
| to the corresponding location. Then, the system refines Pt by alternating between
|
1102
|
+
| fitting points and fitting parameters.
|
1103
|
+
| A new plane can be found by projecting remaining points for the x, y, and z axes.
|
1104
|
+
| For each axis direction, a histogram is built with the bin size 20cm. The system then
|
1105
|
+
| tests the plane equation for populated bins. Compared to the RANSAC procedure
|
1106
|
+
| for initialization, the Manhattan-world assumption reduces the number of degrees of
|
1107
|
+
meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 23
|
1108
|
+
blank |
|
1109
|
+
|
|
1110
|
+
|
|
1111
|
+
text | freedom from three to one, making plane extraction more efficient.
|
1112
|
+
| For extracted planes, the boundary edges are also extracted; the system detects
|
1113
|
+
| groups of boundary points that can be explained by an axis-parallel line segment.
|
1114
|
+
| The system also retains the information about relative positions for extracted planes
|
1115
|
+
| (left/right). As long as the sensor is not flipped upside-down, this information pro-
|
1116
|
+
| vides an important cue to build a room with the correct topology, even when the
|
1117
|
+
| connectivity between neighboring planes has not been observed.
|
1118
|
+
blank |
|
1119
|
+
title | Data Association
|
1120
|
+
blank |
|
1121
|
+
text | After the planes are extracted, the data association process finds the link between the
|
1122
|
+
| global map Mt and the extracted planes to be Pt , a 2-D array of plane labels for each
|
1123
|
+
| pixel. The system automatically finds plane labels that existed from the previous
|
1124
|
+
| frame and extract the plane by copying over the plane labels using correspondences.
|
1125
|
+
| The plane labels for the newly detected plane can be found by comparing T t (Ft )
|
1126
|
+
| and Mt . In addition to the plane equation, the relative position of the newly observed
|
1127
|
+
| plane with respect to other observed planes is used to label the plane. If the plane
|
1128
|
+
| has not been previously observed, a new plane will be added into Ltr based on the
|
1129
|
+
| left-right information.
|
1130
|
+
| After the data association step, the system updates the sequence of observation
|
1131
|
+
| S. The planes that have been assigned as previously observed are used for global
|
1132
|
+
| adjustment (Sec. 2.3.3). If a new plane is observed, the room Rtr will be updated
|
1133
|
+
| accordingly (Sec. 2.3.4).
|
1134
|
+
blank |
|
1135
|
+
|
|
1136
|
+
title | 2.3.3 Global Adjustment
|
1137
|
+
text | Due to noise in the point cloud, frame-to-frame registration is not perfect, and er-
|
1138
|
+
| ror accumulates over time. This is a common problem in pose estimation. Large-
|
1139
|
+
| scale localization approaches use bundle adjustment to compensate error accumula-
|
1140
|
+
| tion [TMHF00, Thr02]. Enforcing this global constraint involves detecting landmark
|
1141
|
+
| objects, or stationary objects observed at different times during a sequence of mea-
|
1142
|
+
| surements. Usually this global adjustment becomes an optimization problem in many
|
1143
|
+
meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 24
|
1144
|
+
blank |
|
1145
|
+
|
|
1146
|
+
|
|
1147
|
+
|
|
1148
|
+
text | Figure 2.7: As errors accumulate in T t and in measurements, the map Mt becomes
|
1149
|
+
| inconsistent. By comparing previous and recent measurements, the system can correct
|
1150
|
+
| for inconsistency and update the value of c such that c = a.
|
1151
|
+
blank |
|
1152
|
+
text | dimensions. The problem is formulated by constraining the landmarks to predefined
|
1153
|
+
| global locations, and by solving an energy function that encodes noise in a pose es-
|
1154
|
+
| timation of both sensor and landmark locations. The Manhattan-world assumption
|
1155
|
+
| allows us to reduce the error accumulation efficiently in real time by refining our
|
1156
|
+
| registration estimate and by optimizing the global map.
|
1157
|
+
blank |
|
1158
|
+
title | Refining the Registration
|
1159
|
+
blank |
|
1160
|
+
text | After data association, the system performs a second round of registration with re-
|
1161
|
+
| spect to the global map Mt to reduce the error accumulation in T t by incremental,
|
1162
|
+
| pair-wise registration. The extracted planes Pt , if already observed by the system,
|
1163
|
+
| have been assigned to the planes in Mt that have associated plane equations. For
|
1164
|
+
| example, suppose a point T t (xu,v ) = (x, y, z) has a plane label Pt (u, v) = pk (assigned
|
1165
|
+
| to plane k). If plane k has normal parallel to the x axis, the plane equation in the
|
1166
|
+
| global map Mt can be written as x = x0 (x0 ∈ R). Consequently, the registration
|
1167
|
+
| should be refined to minimize kx − x0 k2 . In other words, the refined registration can
|
1168
|
+
| be found by defining the corresponding point for xu,v as (x0 , y, z). The corresponding
|
1169
|
+
| points are likewise assigned for every point with a plane assignment in Pt . Given the
|
1170
|
+
| correspondence, the system can refine the registration between the current frame Ft
|
1171
|
+
| and the global map Mt . This second round of registration reduces the error in the
|
1172
|
+
| axis direction. In our example, the refinement is active while the plane x = x0 is
|
1173
|
+
| visible and reduces the uncertainty in the x direction with respect to the global map.
|
1174
|
+
| The error in the x direction is not accumulated during the interval.
|
1175
|
+
meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 25
|
1176
|
+
blank |
|
1177
|
+
|
|
1178
|
+
|
|
1179
|
+
title | Optimizing the Map
|
1180
|
+
blank |
|
1181
|
+
text | As error accumulates, the reconstructed map Mt may also require global adjust-
|
1182
|
+
| ment in each axis direction. The Manhattan-world assumption simplifies this global
|
1183
|
+
| optimization into two separate, one-dimensional problems (we are excluding the z
|
1184
|
+
| direction for now, but the idea can be extended to a 3-D case).
|
1185
|
+
| Figure 2.7 shows a simple example in the x-axis direction. Let us assume that
|
1186
|
+
| the figure represents an overhead view of a rectangular room. There should be two
|
1187
|
+
| walls whose normals are parallel to the x-axis. The sensor detects the first wall
|
1188
|
+
| (x = a), sweeps around the room, observes another wall (x = b), and returns to
|
1189
|
+
| the previously observed wall. Because of error accumulation, parts of the same wall
|
1190
|
+
| have two different offset values (x = a and x = c), but by observing the left-right
|
1191
|
+
| relationship between walls, the system infers that the two walls are indeed the same
|
1192
|
+
| wall.
|
1193
|
+
| To optimize the offset values, the system tracks the sequence of observations
|
1194
|
+
| S x = {a, b, c} and the variances at the point of observation for each wall, as well as the
|
1195
|
+
| constraints represented by the pair of the same offset values C x = {(c11 , c12 ) = (a, c)}.
|
1196
|
+
| We introduce two random variables, ∆1 and ∆2 , to constrain the global map op-
|
1197
|
+
| timization. ∆1 is a random variable with mean m1 = b − a and variance σ12 that
|
1198
|
+
| represents the error between the moment when the sensor observes the x = a wall
|
1199
|
+
| and the moment it observes the x = b wall. Likewise, a random variable ∆2 represents
|
1200
|
+
| the error with mean m2 = c − b and variance σ22 .
|
1201
|
+
| Whenever a new constraint is added, or when the system observes a plane that
|
1202
|
+
| was previously observed, the global adjustment routine is triggered. This is usually
|
1203
|
+
| when the user finishes scanning a room by looping around it and returning to the
|
1204
|
+
| first wall measured. By confining the axis direction, the global adjustment becomes
|
1205
|
+
| a one-dimensional quadratic equation:
|
1206
|
+
blank |
|
1207
|
+
text | P k∆i −mi k2
|
1208
|
+
| minS x i σi2
|
1209
|
+
| (2.3)
|
1210
|
+
| x
|
1211
|
+
| s. t. cj1 = cj2 , ∀(cj1 , cj2 ) ∈ C .
|
1212
|
+
meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 26
|
1213
|
+
blank |
|
1214
|
+
|
|
1215
|
+
|
|
1216
|
+
|
|
1217
|
+
text | Figure 2.8: Selection. In sequence (a), the user is observing two new planes in the
|
1218
|
+
| scene (colored white) and one currently included plane (colored blue). The user selects
|
1219
|
+
| one of the new planes by pointing at it and clicking. Then, the second new plane is
|
1220
|
+
| added. All planes are blue in the final frame, confirming that all planes have been
|
1221
|
+
| successfully selected. Sequence (b) shows a configuration where the user has decided
|
1222
|
+
| not to include the large cabinet. Sequence (c) shows successful selection of the ceiling
|
1223
|
+
| and the wall despite clutter.
|
1224
|
+
blank |
|
1225
|
+
title | 2.3.4 Map Update
|
1226
|
+
text | Our algorithm ignores most irrelevant features by using the Manhattan-world as-
|
1227
|
+
| sumption. However, the system cannot distinguish architectural components from
|
1228
|
+
| other axis-aligned objects using the Manhattan-world assumption. For example, fur-
|
1229
|
+
| niture, open doors, parts of other rooms that might be visible, or reflections from
|
1230
|
+
| mirrors may be detected as axis-aligned planes. The system solves the challenging
|
1231
|
+
| cases by allowing the user to manually specify the planes that he or she would like to
|
1232
|
+
| include in the final model. This manual specification consists of simply clicking the
|
1233
|
+
| input button during scanning when pointing at a plane, as shown in Figure 2.8. If
|
1234
|
+
| the user enters a new room, a right click of the button indicates that the user wishes
|
1235
|
+
| to include this new room and to optimize it individually. The system creates a new
|
1236
|
+
| loop of planes, and any newly observed planes are added to the loop.
|
1237
|
+
| Whenever a new plane is added to Ltr or there is user input to specify the room
|
1238
|
+
| structure, the map update routine extracts a 2-D rectilinear polygon Rtr from Ltr with
|
1239
|
+
meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 27
|
1240
|
+
blank |
|
1241
|
+
|
|
1242
|
+
|
|
1243
|
+
text | 5.797, 5% 0.104, data i/o
|
1244
|
+
| 11.845, 10% prepare image pre-processing
|
1245
|
+
| 6.728, 6% 0%
|
1246
|
+
| 3.318, 3% optical flow (25%)
|
1247
|
+
| 14.517, 13.203, 12% pair-wise registration
|
1248
|
+
| 13% plane extraction
|
1249
|
+
| data association
|
1250
|
+
| [unit: ms] refine registration
|
1251
|
+
| 58.672, 51% optimize map
|
1252
|
+
blank |
|
1253
|
+
text | Figure 2.9: The average computational time for each step of the system.
|
1254
|
+
blank |
|
1255
|
+
text | the help of user input. A valid rectilinear polygon structure should have alternating
|
1256
|
+
| axis directions for any pair of adjacent walls (a x = xi wall should be connected to
|
1257
|
+
| a y = yj wall). The system starts by adding all selected planes into Rtr as well as
|
1258
|
+
| whichever unselected planes in Ltr are necessary to have alternating axis direction.
|
1259
|
+
| When planes are added, the planes with observed boundary edges are preferred. If
|
1260
|
+
| the two observed walls have the same axis direction, the unobserved wall is added
|
1261
|
+
| between them on the boundary of the planes to form a complete loop.
|
1262
|
+
blank |
|
1263
|
+
|
|
1264
|
+
title | 2.4 Evaluation
|
1265
|
+
text | The goal of the system is building a floor plan of an any possible interior environment.
|
1266
|
+
| In our testing of the system, we mapped different apartments of six different volunteers
|
1267
|
+
| ranging approximately 500-2000 ft2 located at Palo Alto. The residents were living
|
1268
|
+
| in the scanned places and thus the apartments exhibited different amounts and types
|
1269
|
+
| of objects.
|
1270
|
+
| For each data set, we compare the floor plan generated by our system with one
|
1271
|
+
| manually-generated using measurements from a commercially available measuring
|
1272
|
+
| device.1 The current practice in architecture and real estate is to use a point-to-
|
1273
|
+
| point laser device to measure distances between pairs of parallel planes. Making
|
1274
|
+
| such measurements requires a clear, level line of sight between two planes, which
|
1275
|
+
meta | 1
|
1276
|
+
text | measuring range 0.05 to 40m; average measurement accuracy +/- 1.5mm; measurement duration
|
1277
|
+
| < 0.5s to 4s per measurement.
|
1278
|
+
meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 28
|
1279
|
+
blank |
|
1280
|
+
|
|
1281
|
+
|
|
1282
|
+
text | may be time-consuming to find due to the presence of furniture, windows, and other
|
1283
|
+
| obstructions. Moreover, after making all the distance measurements, a user is required
|
1284
|
+
| to manually draw a floor plan that respects the measurements. In our tests, roughly
|
1285
|
+
| 10-20 minutes were needed to build a floor plan of each apartment in the conventional
|
1286
|
+
| way as described.
|
1287
|
+
| Using our system, the data acquisition process took approximately 2-5 minutes per
|
1288
|
+
| apartment to initiate, run, and generate the full floor plan. Table 2.1 summarizes the
|
1289
|
+
| timing data for each data set. The average frame rate is 7.5 frames per second running
|
1290
|
+
| on an Intel 2.50GHz Dual Core laptop. Figure 2.9 depicts the average computational
|
1291
|
+
| time for each step of the algorithm. The pair-wise registration routine (Sec.2.3.1)
|
1292
|
+
| contributes more than half of the computational time followed by the pre-processing
|
1293
|
+
| step of fetching a new frame and calculating optical flow (25%).
|
1294
|
+
| In Figure 2.10, we visually compare the floor plans reconstructed in a conventional
|
1295
|
+
| way with those built by our system. The floor plans in blue were reconstructed using
|
1296
|
+
| point-to-point laser measurements, and the floor plans in red were reconstructed by
|
1297
|
+
| our system. For each apartment, the topology of the reconstructed walls agrees with
|
1298
|
+
| the manually-constructed floor plan. In all cases the detection and labeling of planar
|
1299
|
+
| surfaces by our algorithm enabled the user to add or remove these surfaces from
|
1300
|
+
| the model in real time, allowing the final model to be constructed using only the
|
1301
|
+
| important architectural elements from the scene.
|
1302
|
+
| The overlaid floor plans in Figure 2.10(c) show that that the relative placement of
|
1303
|
+
| the rooms may be misaligned. This is because our global adjustment routine optimizes
|
1304
|
+
| rooms individually, thus errors can accumulate in transitions between rooms. The
|
1305
|
+
| algorithm could be extended to enforce global constraints on the relative placement
|
1306
|
+
| of rooms, such as maintaining a certain wall thickness and/or aligning the outer-most
|
1307
|
+
| walls, but such global constraints may induce other errors.
|
1308
|
+
| Table 2.1 contains a quantitative comparison of the errors. The reported depth
|
1309
|
+
| resolution of the sensor is 0.01m at 2m, and for each model we have an average of
|
1310
|
+
| 0.075m error per wall. The relative error stays in the range of 2-5%, which shows
|
1311
|
+
| that the accumulation of small registration errors continues to accumulate as more
|
1312
|
+
| frames are processed.
|
1313
|
+
meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 29
|
1314
|
+
blank |
|
1315
|
+
|
|
1316
|
+
text | data no. of run average error
|
1317
|
+
| fps
|
1318
|
+
| set frames time m %
|
1319
|
+
| 1 1465 2m 56s 8.32 0.115 4.14
|
1320
|
+
| 2 1009 1m 57s 8.66 0.064 1.90
|
1321
|
+
| 3 2830 5m 19s 8.88 0.053 2.40
|
1322
|
+
| 4 1129 2m 39s 7.08 0.088 2.34
|
1323
|
+
| 5 1533 3m 52s 6.59 0.178 3.52
|
1324
|
+
| 6 2811 7m 4s 6.65 0.096 3.10
|
1325
|
+
| ave. 1795 3m 57s 7.54 0.075 2.86
|
1326
|
+
blank |
|
1327
|
+
text | Table 2.1: Accuracy comparison between floor plans reconstructed by our system, and
|
1328
|
+
| manually constructed floor plans generated from point-to-point laser measurements.
|
1329
|
+
blank |
|
1330
|
+
text | Fundamentally, the limitations of our method reflect the limitations of the Kinect
|
1331
|
+
| sensor, namely, the processing power of the laptop and the assumptions made in our
|
1332
|
+
| approach. Because the accuracy of real-time depth data is worse than that from
|
1333
|
+
| visual features, our approach exhibits larger errors compared to visual SLAM (e.g.,
|
1334
|
+
| [ND10]). Some of the uncertainty can be reduced by adapting approaches from the
|
1335
|
+
| well-explored visual SLAM literature. Still, the system is limited when meaningful
|
1336
|
+
| features can not be detected. The Kinect sensor’s reported measurement range is
|
1337
|
+
| between 1.2 and 3.5m from an object; outside that range, data is noisy or unavailable.
|
1338
|
+
| As a consequence, data in narrow hallways or large atriums was difficult to collect.
|
1339
|
+
| Another source of potential error is a user outpacing the operating rate of approx-
|
1340
|
+
| imately 7.5 fps. This frame rate already allows for a reasonable data capture pace,
|
1341
|
+
| but with more processing power, the pace of the system could always be guaranteed
|
1342
|
+
| to exceed normal human motion.
|
1343
|
+
blank |
|
1344
|
+
|
|
1345
|
+
title | 2.5 Conclusions and Future Work
|
1346
|
+
text | We have presented an interactive system that allows a user to capture accurate ar-
|
1347
|
+
| chitectural information and to automatically generate a floor plan. Leveraging the
|
1348
|
+
| Manhattan-world assumption, we have created a representation that is tractable in
|
1349
|
+
| real time while ignoring clutter. In the presented system, the current status of the
|
1350
|
+
| reconstruction is projected on the scanned environment to enable the user to provide
|
1351
|
+
meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 30
|
1352
|
+
blank |
|
1353
|
+
|
|
1354
|
+
|
|
1355
|
+
text | high-level feedback to the system. This feedback helps overcome ambiguous situa-
|
1356
|
+
| tions and allows the user to interactively specify the important planes that should be
|
1357
|
+
| included in the model.
|
1358
|
+
| If there are not enough features scanned for the system to determine that the
|
1359
|
+
| operator has moved, the system will assume that motion has not occurred, leading to
|
1360
|
+
| general underestimation of wall lengths when no depth or image features are available.
|
1361
|
+
| The challenges can be overcome by including an IMU or other devices to assist in the
|
1362
|
+
| pose tracking of the system.
|
1363
|
+
| We have limited our Manhattan-world features to axis-aligned planes in vertical
|
1364
|
+
| directions. However, in future work, we could generalize the system to handle rec-
|
1365
|
+
| tilinear polyhedra which are not convex in the vertical direction. Furthermore, the
|
1366
|
+
| world could be expanded to include walls that are not aligned with the axes of the
|
1367
|
+
| global coordinate system.
|
1368
|
+
| More broadly, our interactive system can be extended to other applications in
|
1369
|
+
| indoor environments. For example, a user could visualize modifications to the space
|
1370
|
+
| shown in Figure 2.11, where we show a user clicking and dragging a cursor across a
|
1371
|
+
| plane to “add” a window. This example illustrates the range of possible uses of our
|
1372
|
+
| system.
|
1373
|
+
meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 31
|
1374
|
+
blank |
|
1375
|
+
|
|
1376
|
+
|
|
1377
|
+
|
|
1378
|
+
text | house 1
|
1379
|
+
blank |
|
1380
|
+
|
|
1381
|
+
|
|
1382
|
+
|
|
1383
|
+
text | house 2
|
1384
|
+
blank |
|
1385
|
+
|
|
1386
|
+
|
|
1387
|
+
|
|
1388
|
+
text | house 3
|
1389
|
+
blank |
|
1390
|
+
|
|
1391
|
+
|
|
1392
|
+
|
|
1393
|
+
text | house 4
|
1394
|
+
blank |
|
1395
|
+
|
|
1396
|
+
|
|
1397
|
+
|
|
1398
|
+
text | house 5
|
1399
|
+
blank |
|
1400
|
+
|
|
1401
|
+
|
|
1402
|
+
|
|
1403
|
+
text | house 6
|
1404
|
+
| (a) (b) (c)
|
1405
|
+
blank |
|
1406
|
+
text | Figure 2.10: (a) Manually constructed floor plans generated from point-to-point laser
|
1407
|
+
| measurements, (b) floor plans acquired with our system, and (c) overlay. For house
|
1408
|
+
| 4, some parts (pillars in large open space, stairs, and an elevator) are ignored by the
|
1409
|
+
| user. The system still uses the measurements from those parts and other objects to
|
1410
|
+
| correctly understand the relative positions of the rooms.
|
1411
|
+
meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 32
|
1412
|
+
blank |
|
1413
|
+
|
|
1414
|
+
|
|
1415
|
+
|
|
1416
|
+
text | Figure 2.11: The system, having detected the planes in the scene, also allows the user
|
1417
|
+
| to interact directly with the physical world. Here the user adds a window to the room
|
1418
|
+
| by dragging a cursor across the wall (left). This motion updates the internal model
|
1419
|
+
| of the world (right).
|
1420
|
+
meta | Chapter 3
|
1421
|
+
blank |
|
1422
|
+
title | Acquiring 3D Indoor Environments
|
1423
|
+
| with Variability and Repetition2
|
1424
|
+
blank |
|
1425
|
+
text | Unlike mapping of urban environments, interior mapping would focus on interior
|
1426
|
+
| objects, which can be geometrically complex, located in cluttered setting and undergo
|
1427
|
+
| significant variations. In addition, the indoor 3-D data captured from RGB-D cameras
|
1428
|
+
| suffer from limited resolution and data quality. The process is further complicated
|
1429
|
+
| when the model deforms between successive acquisitions. The work described in this
|
1430
|
+
| chapter focused on acquiring and understanding objects in interiors of public buildings
|
1431
|
+
| (e.g., schools, hospitals, hotels, restaurants, airports, train stations) or office buildings
|
1432
|
+
| from RGB-D camera scans of such interiors.
|
1433
|
+
| We exploited three observations to make the problem of indoor 3D acquisition
|
1434
|
+
| tractable: (i) most such building interiors are composed of basic elements such as
|
1435
|
+
| walls, doors, windows, furniture (e.g., chairs, tables, lamps, computers, cabinets),
|
1436
|
+
| which come from a small number of prototypes and repeat many times. (ii) such
|
1437
|
+
| building components usually consist of rigid parts of simple geometry, i.e., they have
|
1438
|
+
| surfaces that are well approximated by planar, cylindrical, conical, spherical proxies.
|
1439
|
+
| Further, although variability and articulation are dominant (e.g., a chair is moved
|
1440
|
+
meta | 2
|
1441
|
+
text | The contents of the chapter was originally published as Young Min Kim, Niloy J. Mitra,
|
1442
|
+
| Dong-Ming Yan, and Leonidas Guibas. 2012. Acquiring 3D indoor environments with vari-
|
1443
|
+
| ability and repetition. ACM Trans. Graph. 31, 6, Article 138 (November 2012), 11 pages.
|
1444
|
+
| DOI=10.1145/2366145.2366157 http://doi.acm.org/10.1145/2366145.2366157.
|
1445
|
+
blank |
|
1446
|
+
|
|
1447
|
+
meta | 33
|
1448
|
+
| CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 34
|
1449
|
+
blank |
|
1450
|
+
|
|
1451
|
+
|
|
1452
|
+
|
|
1453
|
+
text | office scene
|
1454
|
+
blank |
|
1455
|
+
|
|
1456
|
+
|
|
1457
|
+
|
|
1458
|
+
text | input single-view scan recognized objects retrieved and posed models
|
1459
|
+
blank |
|
1460
|
+
|
|
1461
|
+
text | Figure 3.1: (Left) Given a single view scan of a 3D environment obtained using a
|
1462
|
+
| fast range scanner, the system performs scene understanding by recognizing repeated
|
1463
|
+
| objects, while factoring out their modes of variability (middle). The repeating ob-
|
1464
|
+
| jects have been learned beforehand as low-complexity models, along with their joint
|
1465
|
+
| deformations. The system extracts the objects despite a poor-quality input scan with
|
1466
|
+
| large missing parts and many outliers. The extracted parameters can then be used
|
1467
|
+
| to pose 3D models to create a plausible scene reconstruction (right).
|
1468
|
+
blank |
|
1469
|
+
text | or rotated, a lamp arm is bent and adjusted), such variability is limited and low-
|
1470
|
+
| dimensional (e.g., translational motion, hinge joint, telescopic joint). (iii) mutual
|
1471
|
+
| relationships among the basic objects satisfy strong priors (e.g., a chair stands on the
|
1472
|
+
| floor, a monitor rests on the table).
|
1473
|
+
| We present a simple yet practical system to acquire models of indoor objects such
|
1474
|
+
| as furniture, together with their variability modes, and discover object repetitions
|
1475
|
+
| and exploit them to speed up large-scale indoor acquisition towards high-level scene
|
1476
|
+
| understanding. Our algorithm works in two phases. First, in the learning phase, the
|
1477
|
+
| system starts from a few scans of individual objects to construct primitive-based 3D
|
1478
|
+
| models while explicitly recovering respective joint attributes and modes of variation.
|
1479
|
+
| Second, in the fast recognition phase (about 200ms/model), the system starts from a
|
1480
|
+
| single-view scan to segment and classify it into plausible objects, recognize them, and
|
1481
|
+
| extract the pose parameters for the low-complexity models generated in the learning
|
1482
|
+
| phase. Intuitively, our system uses priors for primitive types and their connections,
|
1483
|
+
| thus greatly reducing the number of unknowns to enable model fitting even from
|
1484
|
+
| very sparse and low-resolution datasets, while hierarchically associating subsets of
|
1485
|
+
| scans to parts of objects. We also demonstrate that simple inter- and intra-object
|
1486
|
+
| relations simplify segmentation and classification tasks necessary for high-level scene
|
1487
|
+
| understanding (see [MPWC12] and references therein).
|
1488
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 35
|
1489
|
+
blank |
|
1490
|
+
|
|
1491
|
+
|
|
1492
|
+
text | We tested our method on a range of challenging synthetic and real-world scenes.
|
1493
|
+
| We present, for the first time, basic scene reconstruction for massive indoor scenes
|
1494
|
+
| (e.g., office spaces, building auditoriums on a university campus) from unreliable
|
1495
|
+
| sparse data by exploiting the low-complexity variability of common scene objects. We
|
1496
|
+
| show how we can now detect meaningful changes in an environment. For example,
|
1497
|
+
| our system was able to discover a new object placed in a office space by rescanning the
|
1498
|
+
| scene, despite articulations and motions of the previously extant objects (e.g., desk,
|
1499
|
+
| chairs, monitors, lamps). Thus, the system factors out nuisance modes of variability
|
1500
|
+
| (e.g., motions of the chairs, etc.) from variability modes that has importance in an
|
1501
|
+
| application (e.g., security, where the new scene objects should be flagged).
|
1502
|
+
blank |
|
1503
|
+
|
|
1504
|
+
title | 3.1 Related Work
|
1505
|
+
blank |
|
1506
|
+
title | 3.1.1 Scanning Technology
|
1507
|
+
text | Rusinkiewicz et al. [RHHL02] demonstrated the possibility of real-time lightweight 3D
|
1508
|
+
| scanning. More generally, surface reconstruction from unorganized pointcloud data
|
1509
|
+
| has been extensively studied in computer graphics, computational geometry, and
|
1510
|
+
| computer vision (see [Dey07]). Further, powered by recent developments in real-time
|
1511
|
+
| range scanning, everyday users can now easily acquire 3D data at high frame-rates.
|
1512
|
+
| Researchers have proposed algorithms to accumulate multiple poor-quality individual
|
1513
|
+
| frames to obtain better quality pointclouds [MFO+ 07, HKH+ 12, IKH+ 11]. Our main
|
1514
|
+
| goal differed, however, because our system focused on recognizing important elements
|
1515
|
+
| and semantically understanding large 3D indoor environments.
|
1516
|
+
blank |
|
1517
|
+
|
|
1518
|
+
title | 3.1.2 Geometric Priors for Objects
|
1519
|
+
text | Our system utilizes geometry on the level of individual objects, which are possible
|
1520
|
+
| abstractions used by humans to understand the environment [MZL+ 09]. Similar to Xu
|
1521
|
+
| et al. [XLZ+ 10], we understand an object as a collection of primitive parts and segment
|
1522
|
+
| the object based on the prior. Such a prior can successfully fill regions of missing
|
1523
|
+
| parts [PMG+ 05], infer plausible part motions of mechanical assemblies [MYY+ 10],
|
1524
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 36
|
1525
|
+
blank |
|
1526
|
+
|
|
1527
|
+
|
|
1528
|
+
text | extract shape by deforming a template model to match silhouette images [XZZ+ 11],
|
1529
|
+
| locate an object from photographs [XS12], or semantically edit images based of simple
|
1530
|
+
| scene proxies [ZCC+ 12].
|
1531
|
+
| The system focuses on locating 3D deformable objects in unsegmented, noisy,
|
1532
|
+
| single-view data in a cluttered environment. Researchers have used non-rigid align-
|
1533
|
+
| ment to better align (warped) multiple scans [LAGP09]. Alternately, temporal infor-
|
1534
|
+
| mation across multiple frames can be used to track and recover a deformable model
|
1535
|
+
| with joints between rigid parts [CZ11]. Instead, our system learns an instance-specific
|
1536
|
+
| geometric prior as a collection of simple primitives along with deformation modes from
|
1537
|
+
| a very small number of scans. Note that the priors are extracted in the learning stage,
|
1538
|
+
| rather than being hard coded in the framework. We demonstrate that such models
|
1539
|
+
| are sufficiently representative to extract the essence of real-world indoor scenes (see
|
1540
|
+
| also concurrent efforts by Nan et al. [NXS12] and Shao et al [SXZ+ 12].)
|
1541
|
+
blank |
|
1542
|
+
|
|
1543
|
+
title | 3.1.3 Scene Understanding
|
1544
|
+
text | In the context of image understanding, Lee et al. [LGHK10] constructed a box-
|
1545
|
+
| based reconstruction of indoor scenes using volumetric considerations, while Gupta
|
1546
|
+
| et al. [GEH10] applied geometric constraints and physical considerations to obtain a
|
1547
|
+
| block-based 3D scene model. In the context of range scans, there have been only a few
|
1548
|
+
| efforts: Triebel et al. [TSS10] presented an unsupervised algorithm to detect repeating
|
1549
|
+
| parts by clustering on pre-segmented input data, while Koppula et al. [KAJS11] used
|
1550
|
+
| a graphical model to learn features and contextual relations across objects. Earlier,
|
1551
|
+
| Schnabel et al. [SWWK08] detected features in large point clouds using constrained
|
1552
|
+
| graphs that describe configurations of basic shapes (e.g., planes, cylinders, etc.) and
|
1553
|
+
| then performed a graph matching, which cannot be directly used in large, cluttered
|
1554
|
+
| environments captured at low resolutions.
|
1555
|
+
| Various learning-based approaches have recently been proposed to analyze and
|
1556
|
+
| segment 3D geometry, especially towards consistent segmentation and part-label asso-
|
1557
|
+
| ciation [HKG11, SvKK+ 11]. While similar MRF or CRF optimization can be applied
|
1558
|
+
| in our settings, we found that a fully geometric algorithm can produce comparable
|
1559
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 37
|
1560
|
+
blank |
|
1561
|
+
|
|
1562
|
+
|
|
1563
|
+
text | high-quality recognition results without extensive training. In our setting, learning
|
1564
|
+
| amounts to recovering the appropriate deformation model for the scanned model
|
1565
|
+
| in terms of arrangement of primitives and their connection types. While most of
|
1566
|
+
| machine-learning approaches are restricted to local features and limited viewpoints,
|
1567
|
+
| our geometric approach successfully handles the variability of objects and utilizes
|
1568
|
+
| extracted high-level information.
|
1569
|
+
blank |
|
1570
|
+
text | Learning
|
1571
|
+
blank |
|
1572
|
+
|
|
1573
|
+
text | I11 I12 I13 ... M1
|
1574
|
+
blank |
|
1575
|
+
text | S I 21 I 22 I 23 ... M2
|
1576
|
+
| Recognition
|
1577
|
+
blank |
|
1578
|
+
|
|
1579
|
+
|
|
1580
|
+
|
|
1581
|
+
text | o1 , o2 ,...
|
1582
|
+
blank |
|
1583
|
+
text | Figure 3.2: Our algorithm consists of two main phases: (i) a relatively slow learn-
|
1584
|
+
| ing phase to acquire object models as collection of interconnect primitives and their
|
1585
|
+
| joint properties and (ii) a fast object recognition phase that takes an average of
|
1586
|
+
| 200 ms/model.
|
1587
|
+
blank |
|
1588
|
+
|
|
1589
|
+
|
|
1590
|
+
|
|
1591
|
+
title | 3.2 Overview
|
1592
|
+
text | Our framework works in two main phases: a learning phase and a recognition phase
|
1593
|
+
| (see Figure 3.2).
|
1594
|
+
| In the learning phase, our system scans each object of interest a few times (typi-
|
1595
|
+
| cally 5-10 scans across different poses). The goal is to consistently segment the scans
|
1596
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 38
|
1597
|
+
blank |
|
1598
|
+
|
|
1599
|
+
|
|
1600
|
+
text | into parts as well as identify the junction between part-pairs to recover the respective
|
1601
|
+
| junction attributes. Such a goal, however, is challenging given the input quality. We
|
1602
|
+
| address the problem using two scene characteristics: (i) many man-made objects are
|
1603
|
+
| well approximated by a collection of simple primitives (e.g., planes, boxes, cylinders)
|
1604
|
+
| and (ii) the types of junctions between such primitives are limited (e.g., hinge, trans-
|
1605
|
+
| lational) and of low-complexity. First, our system recovers a set of stable primitives
|
1606
|
+
| for each individual scan. Then, for each object, the system collectively processes
|
1607
|
+
| the scans to extract a primitive-based proxy representation along with the necessary
|
1608
|
+
| inter-part junction attributes to build a collection of models {M1 , M2 , . . . }.
|
1609
|
+
| In the recognition phase, the system starts with a single scan S of the scene.
|
1610
|
+
| First, the system extracts the dominant planes in the scene – typically they capture
|
1611
|
+
| the ground, walls, desks, etc. The system identifies the ground plane by using the
|
1612
|
+
| (approximate) up-vector from the acquisition device and noting that the points lie
|
1613
|
+
| above the ground. Planes parallel to the ground are tagged as tabletops if they are at
|
1614
|
+
| heights as observed in the training phase (typically 1′ -3′ ) by exploiting the fact that
|
1615
|
+
| working surfaces have similar heights across rooms. The system removes the points
|
1616
|
+
| associated with the ground plane and the candidate tabletop, and perform connected
|
1617
|
+
| component analysis on the remaining points (on a kn -nearest neighbor graph) to
|
1618
|
+
| extract pointsets {o1 , o2 , . . . }.
|
1619
|
+
| The system tests if each pointset oi can be satisfactorily explained by any of the
|
1620
|
+
| object models Mj . Note, however, that this step is difficult since the data is unreliable
|
1621
|
+
| and the objects can have large geometric variations due to changes in the position
|
1622
|
+
| and pose of objects. The system performs hierarchical matching which uses the
|
1623
|
+
| learned geometry, while trying to match individual parts first, and exploits simple
|
1624
|
+
| scene priors like (i) placement relations (e.g., monitors are placed on desks, chairs
|
1625
|
+
| rest on the ground) and (ii) allowable repetition modes (e.g., monitors usually repeat
|
1626
|
+
| horizontally, chairs are repeated on the ground). We assume such priors are available
|
1627
|
+
| as domain knowledge (e.g., Fisher et al. [FSH11]).
|
1628
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 39
|
1629
|
+
blank |
|
1630
|
+
|
|
1631
|
+
|
|
1632
|
+
|
|
1633
|
+
text | points super-points parts objects
|
1634
|
+
| I X = {x1 , x2 ,... } P = { p1 , p2 ,... } O = {o1 , o2 ,... }
|
1635
|
+
blank |
|
1636
|
+
|
|
1637
|
+
text | Figure 3.3: Unstructured input point cloud is processed into hierarchical data struc-
|
1638
|
+
| ture composed of super-points, parts, and objects.
|
1639
|
+
blank |
|
1640
|
+
title | 3.2.1 Models
|
1641
|
+
text | Our system represents the objects of interest as models that approximate the object
|
1642
|
+
| shapes while encoding deformation and relationship information (see also [OLGM11]).
|
1643
|
+
| Each model can be thought of as a graph structure, the nodes of which denote the
|
1644
|
+
| primitives and the edges of which encode the nodes’ connectivity and relationship
|
1645
|
+
| to the environment. Currently, the primitive types are limited to box, cylinder, and
|
1646
|
+
| radial structure. A box is used to represent a large flat structure; a cylinder is used to
|
1647
|
+
| represent a long and narrow structure; and a radial structure is used to capture parts
|
1648
|
+
| with discrete rotational symmetry (e.g., the base of a swivel chair). As an additional
|
1649
|
+
| regularization, the system groups parallel cylinders of similar lengths (e.g., legs of
|
1650
|
+
| a desk or arms of a chair), which in turn provides valuable cues for possible mirror
|
1651
|
+
| symmetries.
|
1652
|
+
| The connectivity between a pair of primitives is represented as their transfor-
|
1653
|
+
| mation relative to each other and their possible deformations. Our current imple-
|
1654
|
+
| mentation restricts deformations to be 1-DOF translation, 1-DOF rotation, and an
|
1655
|
+
| attachment. The system tests for translational joints for the cylinders and rotational
|
1656
|
+
| joints for cylinders or boxes (e.g., a hinge joint). An attachment represents the ex-
|
1657
|
+
| istence of a whole primitive node and is especially useful when, depending on the
|
1658
|
+
| configuration, the segmentation of the primitive is ambiguous. For example, the ge-
|
1659
|
+
| ometry of doors or drawers of cabinets is not easily segmented when they are closed,
|
1660
|
+
| and thus they are handled as an attachment when opened.
|
1661
|
+
| Additionally, the system detects contact information for the model, i.e., whether
|
1662
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 40
|
1663
|
+
blank |
|
1664
|
+
|
|
1665
|
+
|
|
1666
|
+
text | the object rests on the ground or on a desk. Note that the system assumes that the
|
1667
|
+
| vertical direction is known for the scene. Both the direction of the model and the
|
1668
|
+
| direction of the ground define a canonical object transformation.
|
1669
|
+
blank |
|
1670
|
+
|
|
1671
|
+
title | 3.2.2 Hierarchical Structure
|
1672
|
+
text | For both the learning and recognition phases, the raw input is unstructured point
|
1673
|
+
| clouds. The input is hierarchically organized by considering neighboring points and
|
1674
|
+
| assign contextual information for each hierarchy level. The scene hierarchy has three
|
1675
|
+
| levels of segmentation (see Figure 3.3):
|
1676
|
+
blank |
|
1677
|
+
text | • super-points X = {x1 , x2 , ...};
|
1678
|
+
| • parts P = {p1 , p2 , ...} (association Xp = {x : P (x) = p}); and
|
1679
|
+
| • objects O = {o1 , o2 , ...} (association Po = {p : O(p) = o}).
|
1680
|
+
blank |
|
1681
|
+
text | Instead of working directly on individual points, our system uses super-points
|
1682
|
+
| x ∈ X as the atomic entities (analogous to super-pixels in images). The system
|
1683
|
+
| creates super-points by uniformly sampling points from the raw measurements and
|
1684
|
+
| associating local neighborhoods with the samples based on the normal consistency
|
1685
|
+
| of points. Such super-points, or a group of points within a small neighborhood, are
|
1686
|
+
| less noisy, while at the same time they are sufficiently small to capture the input
|
1687
|
+
| distribution of points.
|
1688
|
+
| Next, our system aggregates neighboring super-points into primitive parts p ∈ P .
|
1689
|
+
| Such parts are expected to relate to individual primitives of models. Each part p
|
1690
|
+
| comprises a set of superpoints Xp . The system initially find such parts by merging
|
1691
|
+
| neighboring super-points until the region can no longer be approximated by a plane
|
1692
|
+
| (in a least squares sense) with average error less than a threshold θdist . Note that the
|
1693
|
+
| initial association of super-points with parts can change later.
|
1694
|
+
| Objects form the final hierarchy level during the recognition phase for scenes con-
|
1695
|
+
| taining multiple objects. Objects, having been segmented, are mapped to individ-
|
1696
|
+
| ual instances of models, while the association between objects and parts (O(p) ∈
|
1697
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 41
|
1698
|
+
blank |
|
1699
|
+
|
|
1700
|
+
|
|
1701
|
+
text | {1, 2, · · · , No } and Po ) are discovered during the recognition process. Note that dur-
|
1702
|
+
| ing the learning phase the system deals with only one object at a time and hence
|
1703
|
+
| such segmentation is trivial.
|
1704
|
+
| The system creates such a hierarchy in the pre-processing stage using the following
|
1705
|
+
| parameters in all our tests: number of nearest neighbor kn used for normal estimation,
|
1706
|
+
| sampling rate fs for super-points, and distance threshold θdist , which reflects the
|
1707
|
+
| approximate noise level. Table 3.1 shows the actual values.
|
1708
|
+
blank |
|
1709
|
+
text | param. values usage
|
1710
|
+
| kn 50 number of nearest neighbor
|
1711
|
+
| fs 1/100 sampling rate
|
1712
|
+
| θdist 0.1m distance threshold for segmentation
|
1713
|
+
| Ñp 10-20 Equation 3.1
|
1714
|
+
| θheight 0.5 Equation 3.5
|
1715
|
+
| θnormal 20◦ Equation 3.6
|
1716
|
+
| θsize 2θdist Equation 3.7
|
1717
|
+
| λ 0.8 coverage ratio to declare a match
|
1718
|
+
blank |
|
1719
|
+
text | Table 3.1: Parameters used in our algorithm.
|
1720
|
+
blank |
|
1721
|
+
|
|
1722
|
+
|
|
1723
|
+
|
|
1724
|
+
title | 3.3 Learning Phase
|
1725
|
+
text | The input to the learning phase is a set of point clouds {I 1 , . . . , I n } obtained from
|
1726
|
+
| the same object in different configurations. Our goal is to build a model M consisting
|
1727
|
+
| of primitives that are linked by joints. Essentially, the system has to simultaneously
|
1728
|
+
| segment the scans into an unknown number of parts, establish correspondence across
|
1729
|
+
| different measurements, and extract relative deformations. We simplify the problem
|
1730
|
+
| by assuming that each part can be represented by primitives and that each joint
|
1731
|
+
| can be encoded with a simple degree of freedom (see also [CZ11]). This assumption
|
1732
|
+
| allows us to approximate many man-made objects, while at the same time it leads to
|
1733
|
+
| a lightweight model. Note that, unlike Schnabel et al. [SWWK08], who use patches
|
1734
|
+
| of partial primitives, our system uses full primitives to represent parts in the learning
|
1735
|
+
| phase.
|
1736
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 42
|
1737
|
+
blank |
|
1738
|
+
|
|
1739
|
+
|
|
1740
|
+
text | Initialize the skeleton (Sec. 3.3.1)
|
1741
|
+
| Mark stable parts/ Match marked Jointly fit primitives
|
1742
|
+
| Update parts
|
1743
|
+
| part-groups parts to matched parts
|
1744
|
+
blank |
|
1745
|
+
|
|
1746
|
+
|
|
1747
|
+
|
|
1748
|
+
text | Incrementally complete the coherent model (Sec. 3.3.2)
|
1749
|
+
| Match parts by Jointly fit primitives
|
1750
|
+
| Update parts
|
1751
|
+
| relative position to matched parts
|
1752
|
+
blank |
|
1753
|
+
|
|
1754
|
+
|
|
1755
|
+
|
|
1756
|
+
text | Figure 3.4: The learning phase starts by initializing the skeleton model, which is
|
1757
|
+
| defined from coherent matches of stable parts. After initialization, new primitives are
|
1758
|
+
| added by finding groups of parts at similar relative locations, and then the primitives
|
1759
|
+
| are jointly fitted.
|
1760
|
+
blank |
|
1761
|
+
text | The learning phase starts by detecting large and stable parts to establish a global
|
1762
|
+
| reference frame across different measurements I i (Section 3.3.1). The initial corre-
|
1763
|
+
| spondences serve as a skeleton of the model, while other parts are incrementally added
|
1764
|
+
| to the model until all of the points are covered within threshold θdist (Section 3.3.2).
|
1765
|
+
| While primitive fitting is unstable over isolated noisy scans, our system jointly refines
|
1766
|
+
| the primitives to construct a coherent model M (see Figure 3.4).
|
1767
|
+
| The final model also contains attributes necessary for robust matching. For ex-
|
1768
|
+
| ample, the distribution of height from the ground plane provides a prior for tables;
|
1769
|
+
| objects can have preferred a repetition direction, e.g., monitors or auditorium chairs
|
1770
|
+
| are typically repeated sidewise; or objects can have preferred orientations. These
|
1771
|
+
| learned attributes and relationships act as reliable regularizers in the recognition
|
1772
|
+
| phase, when data is typically sparse, incomplete, and noisy.
|
1773
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 43
|
1774
|
+
blank |
|
1775
|
+
|
|
1776
|
+
|
|
1777
|
+
title | 3.3.1 Initializing the Skeleton of the Model
|
1778
|
+
text | The initial structure is derived from large, stable parts across different measurements,
|
1779
|
+
| whose consistent correspondences define the reference frame that aligns the measure-
|
1780
|
+
| ments. In the pre-processing stage, individual scans I i are divided into super-points
|
1781
|
+
| X i and parts P i , as described in Section 3.2.2. The system then marks the stable
|
1782
|
+
| parts of candidate boxes or candidate cylinders.
|
1783
|
+
| A candidate face of a box is marked by finding parts with a sufficient number of
|
1784
|
+
| super-points:
|
1785
|
+
| |Xp | > |P|/Ñp , (3.1)
|
1786
|
+
blank |
|
1787
|
+
text | where Ñp is a user-defined parameter of the approximate number of primitives in the
|
1788
|
+
| model. In our tests, a threshold of 10-20 is used. Parallel planes with comparable
|
1789
|
+
| heights are grouped together based on their orientation to constitute the opposite
|
1790
|
+
| faces of a box primitive.
|
1791
|
+
| The system classifies a part as a candidate cylinder if the ratio of the top two
|
1792
|
+
| principle components is greater than 2. Subsequently, parallel cylinders with similar
|
1793
|
+
| heights (e.g., legs of chairs) are grouped.
|
1794
|
+
| After candidate boxes and cylinders are marked, the system matches the marked
|
1795
|
+
| (sometimes grouped) parts for pairs of measurements P i . The system only uses the
|
1796
|
+
| consistent matches to define a reference frame between measurements and jointly fit
|
1797
|
+
| primitives to the matched parts (see Section 3.3.2).
|
1798
|
+
blank |
|
1799
|
+
title | Matching
|
1800
|
+
blank |
|
1801
|
+
text | After extracting the stable parts P i for each measurement, our goal is to match the
|
1802
|
+
| parts across different measurements to build a connectivity structure. The system
|
1803
|
+
| picks a seed measurement j ∈ {1, 2, ..., n} at random and compare every other mea-
|
1804
|
+
| surement against the seed measurement.
|
1805
|
+
| Our system then uses spectral correspondences [LH05] to match parts in seed
|
1806
|
+
| {p, q} ∈ P k and other {p′ , q ′ } ∈ P i . The system builds an affinity matrix A, where
|
1807
|
+
| each entry represents the matching score between part pairs. Recall that candidate
|
1808
|
+
| parts p have associated types (box or cylinder), say t(p). Intuitively, the system
|
1809
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 44
|
1810
|
+
blank |
|
1811
|
+
|
|
1812
|
+
|
|
1813
|
+
text | assigns a higher matching score for the parts with the same type t(p) at similar
|
1814
|
+
| relative positions. If a candidate assignment a = (p, p′ ) assigns p ∈ P j to p′ ∈ P i , the
|
1815
|
+
| corresponding entries are defined as the following:
|
1816
|
+
| (
|
1817
|
+
| 0 if t(p) 6= t(p′ )
|
1818
|
+
| A(a, a) = (3.2)
|
1819
|
+
| exp(−(hp − hp′ )2 /2θdist
|
1820
|
+
| 2
|
1821
|
+
| ) otherwise,
|
1822
|
+
blank |
|
1823
|
+
text | where our system uses the height from the ground hp as a feature. The affinity value
|
1824
|
+
| for a pair-wise assignment between a = (p, p′ ) and b = (q, q ′ ) (p, q ∈ P j and p′ , q ′ ∈ P i )
|
1825
|
+
| is defined as:
|
1826
|
+
| (
|
1827
|
+
| 0 if t(p) 6= t(p′ ) or t(q) 6= t(q ′ )
|
1828
|
+
| A(a, b) = ′ ′ 2 (3.3)
|
1829
|
+
| exp(− (d(p,q)−d(p
|
1830
|
+
| 2θ 2
|
1831
|
+
| ,q ))
|
1832
|
+
| ) otherwise,
|
1833
|
+
| dist
|
1834
|
+
blank |
|
1835
|
+
|
|
1836
|
+
|
|
1837
|
+
text | where d(p, q) represents the distance between two parts p, q ∈ P . The system ex-
|
1838
|
+
| tracts the most dominant eigenvector of A to establish a correspondence among the
|
1839
|
+
| candidate parts.
|
1840
|
+
| After comparing the seed measurement P j against all the other measurements P i ,
|
1841
|
+
| the system retains those matches only that are consistent across different measure-
|
1842
|
+
| ments. The relative positions of the matched parts define the reference frame of the
|
1843
|
+
| object as well as the relative transformation between measurements.
|
1844
|
+
blank |
|
1845
|
+
title | Joint Primitive Fitting
|
1846
|
+
blank |
|
1847
|
+
text | Our system jointly fits primitives to the grouped parts, while adding necessary defor-
|
1848
|
+
| mation. First, the primitive type is fixed by testing for the three types of primitives
|
1849
|
+
| (box, cylinder, and rotational structure) and picking the primitive with the smallest
|
1850
|
+
| fitting error. Once the primitive type is fixed, the corresponding primitives from other
|
1851
|
+
| measurements are averaged and added to the model as a jointly fitted primitive.
|
1852
|
+
| Our system uses the coordinate frame to position the fitted primitives. More
|
1853
|
+
| specifically, the three orthogonal directions of a box are defined by the frame of
|
1854
|
+
| reference defined by the ground direction and the relative positions of the matched
|
1855
|
+
| parts. If the normal of the largest observed face does not align with the default frame
|
1856
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 45
|
1857
|
+
blank |
|
1858
|
+
|
|
1859
|
+
|
|
1860
|
+
text | of reference, the box is rotated around an axis to align the large plane. The cylinder
|
1861
|
+
| is aligned using its axis, while the rotational primitive is tested when the part is at
|
1862
|
+
| the bottom of an object.
|
1863
|
+
| Note that unlike a cylinder or a rotational structure, a box can introduce new
|
1864
|
+
| faces that are invisible because of the placement rules of objects. For example, the
|
1865
|
+
| bottom of a chair seat or the back of a monitor are often missing in the input scans.
|
1866
|
+
| Hence, the system retains the information about which of the six faces are visible to
|
1867
|
+
| simplify the subsequent recognition phase.
|
1868
|
+
| Our system now encodes the inter-primitive connectivity as an edge of the graph
|
1869
|
+
| structure. The joints between primitives are added by comparing the relationship
|
1870
|
+
| between the parent and child primitives. The first matched primitive acts as a root
|
1871
|
+
| to the model graph. Subsequent primitives are the children of the closest primitive
|
1872
|
+
| among those already existing in the model. A translational joint is added if the size
|
1873
|
+
| of the primitive node varies over different measurements by more than θdist ; or, a
|
1874
|
+
| rotational joint is added when the relative angle between the parent and child node
|
1875
|
+
| differs by more than 20◦ .
|
1876
|
+
blank |
|
1877
|
+
|
|
1878
|
+
title | 3.3.2 Incrementally Completing a Coherent Model
|
1879
|
+
text | Having built an initial model structure, the system incrementally adds primitives by
|
1880
|
+
| processing super-points that could not be explained by the primitives. The remaining
|
1881
|
+
| super-points are processed to create parts, and the parts are matched based on their
|
1882
|
+
| relative positions. Starting from the bottom-most matches, the system jointly fits
|
1883
|
+
| primitives to the matched parts, as described above. The system iterates the process
|
1884
|
+
| until all super-points in measurements are explained by the model.
|
1885
|
+
| If there exist some parts that only exist in a subset of measurements, then the
|
1886
|
+
| system adds an attachment of the primitive. For example, in Figure 3.5, after each
|
1887
|
+
| side of the rectangular shape of a drawer has been matched, the open drawer is added
|
1888
|
+
| as an attachment to the base shape.
|
1889
|
+
| The system also maintains the contact point of a model to the ground (or the
|
1890
|
+
| bottom-most primitive), the height distribution of each part as histogram, visible face
|
1891
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 46
|
1892
|
+
blank |
|
1893
|
+
|
|
1894
|
+
|
|
1895
|
+
|
|
1896
|
+
text | open drawers
|
1897
|
+
blank |
|
1898
|
+
|
|
1899
|
+
|
|
1900
|
+
|
|
1901
|
+
text | unmatched parts
|
1902
|
+
blank |
|
1903
|
+
text | Figure 3.5: The open drawers remain as unmatched (grey) after incremental matching
|
1904
|
+
| and joint primitive fitting. These parts will be added as an attachment of the model.
|
1905
|
+
blank |
|
1906
|
+
text | information, and the canonical frame of reference defined during the matching process.
|
1907
|
+
| This information, along with the extracted models, is used during the recognition
|
1908
|
+
| phase.
|
1909
|
+
blank |
|
1910
|
+
|
|
1911
|
+
title | 3.4 Recognition Phase
|
1912
|
+
text | Having learned a set of models (along with their deformation modes) M := {M1 , . . . , Mk }
|
1913
|
+
| for a particular environment, the system can quickly collect and understand the envi-
|
1914
|
+
| ronment in the recognition phase. This phase is much faster than the learning phase
|
1915
|
+
| since there are only a small number of simple primitives and certain deformation
|
1916
|
+
| modes from which to search. As an input, the scene S containing the learned models
|
1917
|
+
| is collected using the framework from Engelhard et al. [EEH+ 11] which takes a few
|
1918
|
+
| seconds. In a pre-processing stage, the system marks the most dominant plane as the
|
1919
|
+
| ground plane g. Then, the second most dominant plane that is parallel to the ground
|
1920
|
+
| plane is marked as the desk plane d. The system processes the remaining points to
|
1921
|
+
| form a hierarchical structure with super-points, parts, and objects (see Section 3.2.2).
|
1922
|
+
| The recognition phase starts from a part-based assignment, which quickly com-
|
1923
|
+
| pares parts in the measurement and primitive nodes in each model. The algorithm
|
1924
|
+
| infers deformation and transformation of the model from the matched parts, while
|
1925
|
+
| filtering the valid match by comparing actual measurement against the underlying
|
1926
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 47
|
1927
|
+
blank |
|
1928
|
+
|
|
1929
|
+
|
|
1930
|
+
text | Initial assignments for parts (Sec.3.4.1)
|
1931
|
+
| { p1 , p2 ,L} Î oi {m1 , m2 , m3 , l1 , a 3}Î M
|
1932
|
+
| p1 = m3
|
1933
|
+
| m3
|
1934
|
+
| m3 rotational
|
1935
|
+
| a3
|
1936
|
+
| m2
|
1937
|
+
| m2
|
1938
|
+
| m1 translational
|
1939
|
+
| l1 m1
|
1940
|
+
| g
|
1941
|
+
| {o1 , o2 ,L}Î S contact g
|
1942
|
+
blank |
|
1943
|
+
text | Refined assignment with geometry (Sec. 3.4.2)
|
1944
|
+
| Solve for deformation Find correspondence
|
1945
|
+
| given matches (Sec.5.2.a) and segmentation (Sec.5.2.b)
|
1946
|
+
| Iterate p1 = m3
|
1947
|
+
| h( p1 ) = h(m3 ) = f h (l1 , a 3 )
|
1948
|
+
| n
|
1949
|
+
| p2 = m2
|
1950
|
+
| n( p1 ) = n(m3 ) = f (a 3 )
|
1951
|
+
| p3 = m1
|
1952
|
+
blank |
|
1953
|
+
text | Figure 3.6: Overview of the recognition phase. The algorithm first finds matched parts
|
1954
|
+
| before proceeding to recover the entire model and its corresponding segmentation.
|
1955
|
+
blank |
|
1956
|
+
text | geometry. If a sufficient portion of measurements can be explained by the model,
|
1957
|
+
| the system accepts the match as valid, and the segmentation of both object-level and
|
1958
|
+
| part-level is refined to match the model.
|
1959
|
+
blank |
|
1960
|
+
|
|
1961
|
+
title | 3.4.1 Initial Assignment for Parts
|
1962
|
+
text | Our system first makes coarse assignments between segmented parts and model nodes
|
1963
|
+
| to quickly reduce the search space (see Figure 3.6, top). If a part and a primitive node
|
1964
|
+
| form a potential match, the system also induces the relative transformation between
|
1965
|
+
| them. The output of the algorithm is a list of triplets composed of part, node from
|
1966
|
+
| the model, and transformation groups {(p, m, T )}.
|
1967
|
+
| Our system uses geometric features to decide whether individual parts can be
|
1968
|
+
| matched with model nodes. Note that the system does not use color information in
|
1969
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 48
|
1970
|
+
blank |
|
1971
|
+
|
|
1972
|
+
|
|
1973
|
+
text | our setting. As features for individual parts Ap , our system considers the following:
|
1974
|
+
| (i) height distribution from ground plane as a histogram vector hp ; (ii) three principal
|
1975
|
+
| components of of the region x1p , x2p , x3p (x3p = np ); and (iii) sizes along the directions
|
1976
|
+
| lp1 > lp2 > lp3 .
|
1977
|
+
| Similarly, the system infers the counterpart of features for individual visible faces
|
1978
|
+
| of model parts Am . Thus, even if one face of a part is visible from the measurement,
|
1979
|
+
| our system is still able to detect the matched part of the model. The height histogram
|
1980
|
+
| hm is calculated from the relative area per height interval and the dimensions and
|
1981
|
+
| principal components are inferred from the shape of the faces.
|
1982
|
+
| All the parts are compared against all the faces of primitive nodes in the model:
|
1983
|
+
blank |
|
1984
|
+
text | E(Ap , Am ) = (3.4)
|
1985
|
+
| ψ height (hp , hm ) · ψ normal (np , nm ; g) · ψ size ({lp1 , lp2 }, {lm
|
1986
|
+
| 1 2
|
1987
|
+
| , lm }).
|
1988
|
+
blank |
|
1989
|
+
text | Individual potential function ψ returns either 1 (matched) or 0 (not matched) de-
|
1990
|
+
| pending on if a feature satisfies the criteria within an allowable threshold. Parts are
|
1991
|
+
| matched only if all the features criteria are satisfied. The height potential calculates
|
1992
|
+
| the histogram intersection
|
1993
|
+
| X
|
1994
|
+
| ψ height (hp , hm ) = min(hp (i), hm (i)) > θheight . (3.5)
|
1995
|
+
| i
|
1996
|
+
blank |
|
1997
|
+
|
|
1998
|
+
text | The normal potential calculates the relative angle with the ground plane normal (ng )
|
1999
|
+
| as
|
2000
|
+
| ψ normal (np , nm ; g) = |acos(np · ng ) − acos(nm · ng )| < θnormal . (3.6)
|
2001
|
+
blank |
|
2002
|
+
text | The size potential compares the size of the part
|
2003
|
+
blank |
|
2004
|
+
text | 1
|
2005
|
+
| ψ size ({lp1 , lp2 }, {lm 2
|
2006
|
+
| , lm }) = |lp1 − lm
|
2007
|
+
| 1
|
2008
|
+
| | < θsize and |lp2 − lm
|
2009
|
+
| 2
|
2010
|
+
| | < θsize . (3.7)
|
2011
|
+
blank |
|
2012
|
+
text | Our system sets the threshold generously to allow false positives and retain multiple
|
2013
|
+
| (or none) matched parts per object (see Table 3.1). In effect, the system first guesses
|
2014
|
+
| potential object-model associations and later prunes out the incorrect associations
|
2015
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 49
|
2016
|
+
blank |
|
2017
|
+
|
|
2018
|
+
|
|
2019
|
+
text | in the refinement step using the full geometry (see Section 3.4.2). If Equation 3.4
|
2020
|
+
| returns 1, then the system can obtain a good estimate of the relative transformation
|
2021
|
+
| T between the model and the part by using the position, normal, and the ground
|
2022
|
+
| plane direction to create a triplet (p, m, T ).
|
2023
|
+
blank |
|
2024
|
+
|
|
2025
|
+
title | 3.4.2 Refined Assignment with Geometry
|
2026
|
+
text | Starting from the list of part, node, and transformation triplets {(p, m, T )}, the sys-
|
2027
|
+
| tem verifies the assignments with a full model by comparing a segmented object
|
2028
|
+
| o = O(p) against models Mi . The goal is to produce accurate part assignments for
|
2029
|
+
| observable parts, transformation, and the deformation parameters. Intuitively, the
|
2030
|
+
| system finds a local minimum from the suggested starting point (p, m, T ) with the
|
2031
|
+
| help of the models extracted in the learning phase. The system then optimizes by
|
2032
|
+
| alternately refining the model pose and updating the segmentation (see Figure 3.6,
|
2033
|
+
| bottom).
|
2034
|
+
| Given the assignment between p and m, the system first refines the registration and
|
2035
|
+
| deformation parameters and places the model M to best explain the measurements.
|
2036
|
+
| If the placed model covers most of the points that belong to the object (ratio λ = 0.8
|
2037
|
+
| in our tests) within the distance threshold θdist , then the system confirms that the
|
2038
|
+
| model is matched to the object. Note that, compared to the generous threshold in
|
2039
|
+
| part-matching in Section 5.1, the system now sets a conservative threshold to prune
|
2040
|
+
| false-positives.
|
2041
|
+
| In the case of a match, the geometry is fixed and the system refines the segmen-
|
2042
|
+
| tation, i.e., the part and object boundaries are modified to match the underlying
|
2043
|
+
| geometry. The process is iterated until convergence.
|
2044
|
+
blank |
|
2045
|
+
title | Refining Deformation and Registration
|
2046
|
+
blank |
|
2047
|
+
text | Our system finds the deformation parameters using the relative location and orien-
|
2048
|
+
| tation of parts and the contact plane (e.g., desk top, the ground plane). Given any
|
2049
|
+
| pair of parts, or a part and the ground plane, their mutual distance and orientation
|
2050
|
+
| are formulated as functions of deformation parameters existing between the path of
|
2051
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 50
|
2052
|
+
blank |
|
2053
|
+
|
|
2054
|
+
|
|
2055
|
+
title | Input points Models matched Parts assigned
|
2056
|
+
blank |
|
2057
|
+
|
|
2058
|
+
|
|
2059
|
+
|
|
2060
|
+
text | Initial objects Refined objects
|
2061
|
+
blank |
|
2062
|
+
text | Figure 3.7: The initial object-level segmentation can be imperfect especially between
|
2063
|
+
| distant parts. For example, the top and base of a chair initially appeared to be sep-
|
2064
|
+
| arate objects, but were eventually understood as the same object after the segments
|
2065
|
+
| were refined based on the geometry of the matched model.
|
2066
|
+
blank |
|
2067
|
+
text | the two parts. For example, if our system starts from matched part-primitive pair p1
|
2068
|
+
| and m3 in Figure 3.6, then the height and the normal of the part can be expressed as
|
2069
|
+
| function of the deformation parameters l1 and α3 of the model. The system solves a
|
2070
|
+
| set of linear equations given for the observed parts and the contact location to solve
|
2071
|
+
| for the deformation parameters. Then, the registration between the scan and the
|
2072
|
+
| deformed model is refined by Iterative Closest Point (ICP) [BM92].
|
2073
|
+
| Ideally, part p in the scene measurement should be explained by the assigned
|
2074
|
+
| part geometry within the distance threshold θdist . The model is matched to the
|
2075
|
+
| measurement if the proportion of points within θdist is more than λ. (Note that not
|
2076
|
+
| all faces of the part need to be explained by the region measurement as only a subset
|
2077
|
+
| of the model is measured by the sensor.) Otherwise, the triplet (p, m, T ) is an invalid
|
2078
|
+
| assignment and the algorithm returns false. After initial matching (Section 3.4.1),
|
2079
|
+
| multiple parts of an object can match to different primitives of many models. If there
|
2080
|
+
| are multiple successful matches for an object, the system retains the assignment with
|
2081
|
+
| the most number of points.
|
2082
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 51
|
2083
|
+
blank |
|
2084
|
+
|
|
2085
|
+
|
|
2086
|
+
title | Refine Segmentation
|
2087
|
+
blank |
|
2088
|
+
text | After a model is picked and positioned in the configuration, its location is fixed
|
2089
|
+
| while the system refines the segmentation based on the underlying model. Recall
|
2090
|
+
| that the initial segment of parts P merge super-points with similar normals and
|
2091
|
+
| objects O group neighboring parts using the distance threshold. Although the initial
|
2092
|
+
| segmentations provide a sufficient approximation to roughly locate the models, they
|
2093
|
+
| do not necessarily coincide with the actual part and object boundaries without being
|
2094
|
+
| compared against the geometry.
|
2095
|
+
| First, the system updates the association between super-points and the parts by
|
2096
|
+
| finding the closest primitive node of the model for each super-point. The super-points
|
2097
|
+
| that belong to the same model node are grouped to the same part (see Figure 3.7).
|
2098
|
+
| In contrast, super-points that are farther away than the distance threshold θdist from
|
2099
|
+
| any of the primitives are separated to form a new segment with a null assignment.
|
2100
|
+
| After the part assignment, the system searches for the missing primitives by merg-
|
2101
|
+
| ing neighboring objects (see Figure 3.7). In the initial segmentation, objects which
|
2102
|
+
| are close to each other in the scene can lead to multiple objects grouped into a sin-
|
2103
|
+
| gle segment. Further, particular viewpoints of an object can cause parts within the
|
2104
|
+
| model to appear farther apart, leading to spurious multiple segments. Hence, the
|
2105
|
+
| super-points are assigned to an object only after the existence of the object is verified
|
2106
|
+
| with the underlying geometry.
|
2107
|
+
blank |
|
2108
|
+
|
|
2109
|
+
title | 3.5 Results
|
2110
|
+
text | In this section, we present the performance results obtained from testing our system
|
2111
|
+
| on various synthetic and real-world scenes.
|
2112
|
+
blank |
|
2113
|
+
|
|
2114
|
+
title | 3.5.1 Synthetic Scenes
|
2115
|
+
text | We tested our framework on synthetic scans of 3D scenes obtained from the Google
|
2116
|
+
| 3D Warehouse (see Figure 3.8). We implemented a virtual scanner to generate the
|
2117
|
+
| synthetic data: once the user specifies a viewpoint, we read the depth buffer to recover
|
2118
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 52
|
2119
|
+
blank |
|
2120
|
+
|
|
2121
|
+
|
|
2122
|
+
text | 3D range data of the virtual scene from the specified viewpoint. We control the scan
|
2123
|
+
| quality using three parameters: (i) scanning density d to control the fraction points
|
2124
|
+
| that are retained, (ii) noise level g to control the zero mean Gaussian noise added to
|
2125
|
+
| each point along the current viewing direction, and (iii) the angle noise a to perturb
|
2126
|
+
| the position in the local tangent plane using zero mean Gaussian noise. Unless stated,
|
2127
|
+
| we used default values of d = 0.4, g = 0.01, and a = 5◦ .
|
2128
|
+
| In Figure 3.8, we present typical recognition results using our framework. The
|
2129
|
+
| system learned different models of chairs and placed them with varying deformations
|
2130
|
+
| (see Table 3.2). We exaggerated some of the deformation modes, including very
|
2131
|
+
| high chairs and severely tilted monitors, but could still reliably detect them all (see
|
2132
|
+
| Table 3.3). Beyond recognition, our system reliably recovered both positions and
|
2133
|
+
| pose parameters within 5% error margin of the object size. Incomplete data can,
|
2134
|
+
| however, result in ambiguities: for example, in synthetic #2 our system correctly
|
2135
|
+
| detected a chair, but displayed it in a flipped position, since the scan contained data
|
2136
|
+
blank |
|
2137
|
+
|
|
2138
|
+
|
|
2139
|
+
|
|
2140
|
+
text | synthetic 1
|
2141
|
+
blank |
|
2142
|
+
|
|
2143
|
+
|
|
2144
|
+
|
|
2145
|
+
text | synthetic 2
|
2146
|
+
blank |
|
2147
|
+
|
|
2148
|
+
|
|
2149
|
+
|
|
2150
|
+
text | synthetic 3
|
2151
|
+
blank |
|
2152
|
+
text | Figure 3.8: Recognition results on synthetic scans of virtual scenes: (left to right) syn-
|
2153
|
+
| thetic scenes, virtual scans, and detected scene objects with variations. Unmatched
|
2154
|
+
| points are shown in gray.
|
2155
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 53
|
2156
|
+
blank |
|
2157
|
+
|
|
2158
|
+
|
|
2159
|
+
text | only from the chair’s back. While specific volume-based reasoning can be used to
|
2160
|
+
| give preference to chairs in an upright position, our system avoided such case-specific
|
2161
|
+
| rules in the current implementation.
|
2162
|
+
blank |
|
2163
|
+
|
|
2164
|
+
|
|
2165
|
+
|
|
2166
|
+
text | similar different
|
2167
|
+
blank |
|
2168
|
+
|
|
2169
|
+
text | Figure 3.9: Chair models used in synthetic scenes.
|
2170
|
+
blank |
|
2171
|
+
text | In practice, acquired data sets suffer from varying sampling resolution, noise, and
|
2172
|
+
| occlusion. While it is difficult to exactly mimic real-world scenarios, we ran synthetic
|
2173
|
+
| tests to access the stability of our algorithm. We placed two classes of chairs (see
|
2174
|
+
| Figure 3.9) on a ground plane, 70-80 chairs of each type, and created scans from
|
2175
|
+
| 5 different viewpoints with varying density and noise parameters. For both classes,
|
2176
|
+
| we used our recognition framework to measure precision and recall while varying
|
2177
|
+
| parameter λ. Note that precision represents how many of the detected objects are
|
2178
|
+
| correctly classified out of total number of detections, while recall represents how many
|
2179
|
+
| objects were correctly detected out of the total number of placed objects. In other
|
2180
|
+
| words, a precision measure of 1 indicates no false positives, while a recall measure of
|
2181
|
+
| 1 indicates there are no false negatives.
|
2182
|
+
| Figure 3.10 shows the corresponding precision-recall curves. The first two plots
|
2183
|
+
| show precision-recall curves using a similar pair of models, where the chairs have sim-
|
2184
|
+
| ilar dimensions, which is expected to result in high false-positive rates (see Figure 3.9,
|
2185
|
+
| left). Not surprisingly, recognition improves with a lower noise margin and/or higher
|
2186
|
+
| sampling density. Performance, however, is saturated with Gaussian noise lower than
|
2187
|
+
| 0.3 and density higher than 0.6 since both our model- and part-based components
|
2188
|
+
| are approximations of the true data, resulting in an inherent discrepancy between
|
2189
|
+
| measurement and the model, even in absence of noise. Note that as long as the parts
|
2190
|
+
| and dimensions are captured, our system still detects objects even under high noise
|
2191
|
+
| and sparse sampling.
|
2192
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 54
|
2193
|
+
blank |
|
2194
|
+
|
|
2195
|
+
|
|
2196
|
+
text | Density (a similar pair) Noise (a similar pair) Data type
|
2197
|
+
| 1.2 1.2 1.2
|
2198
|
+
blank |
|
2199
|
+
text | 1 1 1
|
2200
|
+
blank |
|
2201
|
+
text | 0.8 0.8 0.8
|
2202
|
+
blank |
|
2203
|
+
|
|
2204
|
+
|
|
2205
|
+
|
|
2206
|
+
text | Recall
|
2207
|
+
| Recall
|
2208
|
+
| Recall
|
2209
|
+
blank |
|
2210
|
+
|
|
2211
|
+
|
|
2212
|
+
|
|
2213
|
+
text | 0.6 0.6 0.6
|
2214
|
+
blank |
|
2215
|
+
text | 0.4 0.4 0.4
|
2216
|
+
blank |
|
2217
|
+
text | 0.2 0.2 0.2
|
2218
|
+
blank |
|
2219
|
+
text | 0 0 0
|
2220
|
+
| 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2
|
2221
|
+
| Precision Precision Precision
|
2222
|
+
| density 0.4 density 0.5 Gaussian 0.004 Gaussian 0.008 Gaussian 0.004 Gaussian 0.004
|
2223
|
+
| density 0.6 density 0.7 Gaussian 0.3 Gaussian 0.5 Gaussian 0.3 Gaussian 0.3
|
2224
|
+
| density 0.8 Gaussian 1.0 Gaussian 2.0 Gaussian 1.0 Gaussian 1.0
|
2225
|
+
| Different pair Similar pair
|
2226
|
+
blank |
|
2227
|
+
text | Figure 3.10: Precision-recall curve with varying parameter λ.
|
2228
|
+
blank |
|
2229
|
+
text | Our algorithm has higher robustness when the pair of models are sufficiently
|
2230
|
+
| different (see Figure 3.10, right). We tested with two pairs of chairs (see Figure 3.9):
|
2231
|
+
| the first pair had chairs of similar dimensions as before (in solid lines), while the
|
2232
|
+
| second pair had a chair and a sofa with large geometric differences (in dotted lines).
|
2233
|
+
| When tested with the different pairs, our system achieved precision higher than 0.98
|
2234
|
+
| for recall larger than 0.9. Thus, as long as the geometric space of the objects is sparsely
|
2235
|
+
| populated, our algorithm has a high accuracy in quickly acquiring the geometry of
|
2236
|
+
| environment without assistance from data-driven or machine-learning techniques.
|
2237
|
+
blank |
|
2238
|
+
|
|
2239
|
+
title | 3.5.2 Real-World Scenes
|
2240
|
+
text | The more practical test of our system is its performance on real scanned data since
|
2241
|
+
| it is difficult to synthetically recreate all the artifacts encountered during scanning
|
2242
|
+
| of a actual physical space. We tested our framework on a range of real-world ex-
|
2243
|
+
| amples, each consisting of multiple objects arranged over large spaces (e.g., office
|
2244
|
+
| areas, seminar rooms, auditoriums) at a university. For both the learning and the
|
2245
|
+
| recognition phases, we acquired the scenes using a Microsoft Kinect scanner with an
|
2246
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 55
|
2247
|
+
blank |
|
2248
|
+
|
|
2249
|
+
text | points no. of no. of no. of
|
2250
|
+
| scene model
|
2251
|
+
| per scan scans prim. joints
|
2252
|
+
| chair 28445 7 10 4
|
2253
|
+
| synthetic1 stool 19944 7 3 2
|
2254
|
+
| monitor 60933 7 3 2
|
2255
|
+
| chaira 720364 7 9 5
|
2256
|
+
| synthetic2
|
2257
|
+
| chairb 852072 1 6 0
|
2258
|
+
| synthetic3 chair 253548 4 10 2
|
2259
|
+
| chair 41724 7 8 4
|
2260
|
+
| monitor 20011 5 3 2
|
2261
|
+
| office
|
2262
|
+
| trash bin 28348 2 4 0
|
2263
|
+
| whitebrd. 356231 1 3 0
|
2264
|
+
| auditorium chair 31534 5 4 2
|
2265
|
+
| seminar rm. chair 141301 1 4 0
|
2266
|
+
blank |
|
2267
|
+
text | Table 3.2: Models obtained from the learning phase (see Figure 3.11).
|
2268
|
+
blank |
|
2269
|
+
text | open source scanning library [EEH+ 11]. The scenes were challenging, especially due
|
2270
|
+
| to the amount of variability in the individual model poses (see our project page for
|
2271
|
+
| the input scans and recovered models). Table 3.2 summarizes all the models built
|
2272
|
+
| during the learning stage for these scenes ranging from 3-10 primitives with 0-5 joints
|
2273
|
+
| extracted from only a few scans (see Figure 3.11). While we evaluated our framework
|
2274
|
+
| based on the raw Kinect output rather than on processed data (e.g., [IKH+ 11]), the
|
2275
|
+
| performance limits should be similar when calibrated to the data quality and physical
|
2276
|
+
| size of the objects.
|
2277
|
+
blank |
|
2278
|
+
|
|
2279
|
+
|
|
2280
|
+
|
|
2281
|
+
text | Figure 3.11: Various models learned/used in our test (see Table 3.2).
|
2282
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 56
|
2283
|
+
blank |
|
2284
|
+
|
|
2285
|
+
|
|
2286
|
+
text | Our recognition phase was lightweight and fast, taking on average 200ms to com-
|
2287
|
+
| pare a point cluster to a model on a 2.4Hz CPU with 6GB RAM. For example, in
|
2288
|
+
| Figure 3.1, our system detected all 5 chairs present and 4 of the 5 monitors, along with
|
2289
|
+
| their poses. Note that objects that were not among the learned models remained un-
|
2290
|
+
| detected, including a sofa in the middle of the space and other miscellaneous clutter.
|
2291
|
+
| We overlaid the unresolved points on the recognized parts for comparison. Note that
|
2292
|
+
| our algorithm had access to only the geometry of objects, not any color or texture
|
2293
|
+
| attributes. The complexity of our problem setting can be appreciated by looking at
|
2294
|
+
| the input scan, which is difficult even for a human to parse visually. We observed
|
2295
|
+
| Kinect data to exhibit highly non-linear noise effects that were not simulated in our
|
2296
|
+
| synthetic scans; data also went missing when an object was narrow or specular (e.g.,
|
2297
|
+
| monitor), with flying pixels along depth discontinuities, and severe quantization noise
|
2298
|
+
| for distant objects.
|
2299
|
+
| number of input points objects objects
|
2300
|
+
| scene
|
2301
|
+
| ave. min. max. present detected*
|
2302
|
+
| syn. 1 3227 1168 9967 5c 3s 5m 5c 3s 5m
|
2303
|
+
| syn. 2 2422 1393 3427 4ca 4cb 4ca 4cb
|
2304
|
+
| syn. 3 1593 948 2704 14 chairs 14 chairs
|
2305
|
+
| teaser 6187 2575 12083 5c 5m 0t 5c 4m 0t
|
2306
|
+
| office 1 3452 1129 7825 5c 2m 1t 2w 5c 2m 1t 2w
|
2307
|
+
| office 2 3437 1355 10278 8c 5m 0t 2w 6c 3m 0t 2w
|
2308
|
+
| aud. 1 19033 11377 29260 26 chairs 26 chairs
|
2309
|
+
| aud. 2 9381 2832 13317 21 chairs 19 chairs
|
2310
|
+
| sem. 1 4326 840 11829 13 chairs 11 chairs
|
2311
|
+
| sem. 2 6257 2056 12467 18 chairs 16 chairs
|
2312
|
+
| *c: chair, m: monitor, t: trash bin, w: whiteboard, s: stool
|
2313
|
+
| Table 3.3: Statistics for the recognition phase. For each scene, we also indicate the
|
2314
|
+
| corresponding scene in Figure 3.8 and Figure 3.12, when applicable.
|
2315
|
+
blank |
|
2316
|
+
text | Figure 3.12 compiles the results for cluttered office setups, auditoriums, and sem-
|
2317
|
+
| inar rooms. Although we tested with different scenes, we present only representative
|
2318
|
+
| examples as the performance on all types of scenes was comparable. Our system
|
2319
|
+
| detected the chairs, computer monitors, whiteboards, and trash bins across different
|
2320
|
+
| rooms, and the rows of auditorium chairs in different configurations. Our system
|
2321
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 57
|
2322
|
+
blank |
|
2323
|
+
|
|
2324
|
+
|
|
2325
|
+
text | missed some of the monitors because the material property of the screens were proba-
|
2326
|
+
| bly not favorable to Kinect capture. The missed monitors (as in Figure 3.1 and office
|
2327
|
+
| #2 in Figure 3.12) have big rectangular holes within the screen in the scans. In office
|
2328
|
+
| #2, the system also missed two of the chairs that were mostly occluded and beyond
|
2329
|
+
| what our framework can handle.
|
2330
|
+
| Even under such demanding data quality, our system can recognize the models
|
2331
|
+
| and recover poses from data sets an order of magnitude sparser than those required
|
2332
|
+
| in the learning phase. Surprisingly, the system could also detect the small tables in
|
2333
|
+
| the two auditorium scenes (1 in auditorium #1, and 3 in auditorium #2) and also
|
2334
|
+
| identify pose changes in the auditorium seats. Figure 3.13 shows a close-up office
|
2335
|
+
| scene to better illustrate the deformation modes that our system captured. All of the
|
2336
|
+
| recognized object models have one or more deformation modes, and we can visually
|
2337
|
+
| compare the quality of data to the recovered pose and deformation.
|
2338
|
+
| The segmentation of real-world scenes are challenging with naturally cluttered
|
2339
|
+
| set-ups. The challenge is well demonstrated in the seminar rooms because of closely
|
2340
|
+
| spaced chairs or chairs leaning against the wall. In contrast to the auditorium scenes,
|
2341
|
+
| where the rows of chairs are detected together making the segmentation trivial, in
|
2342
|
+
| the seminar room setting chairs often occlude each other. The quality of data also
|
2343
|
+
| deteriorates because of thin metal legs with specular highlights. Nevertheless, our
|
2344
|
+
| system correctly recognized most of the chairs along with correct configurations by
|
2345
|
+
| first detecting the larger parts. Although only 4-6 chairs were detected in the initial
|
2346
|
+
| iteration, our system eventually detected most of chairs in the seminar rooms by
|
2347
|
+
| refining the segmentation based on the learned geometry (in 3-4 iterations).
|
2348
|
+
blank |
|
2349
|
+
|
|
2350
|
+
title | 3.5.3 Comparisons
|
2351
|
+
text | In the learning phase, our system requires multiple scans of an object to build a proxy
|
2352
|
+
| model along with its deformation modes. Unfortunately, the existing public data sets
|
2353
|
+
| do not provide such multiple scans. Instead, we compared our recognition routine
|
2354
|
+
| to the algorithm proposed by Koppula et al. [KAJS11] using author provided code
|
2355
|
+
| to recognize objects from a real-time stream of Kinect data after the user manually
|
2356
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 58
|
2357
|
+
blank |
|
2358
|
+
|
|
2359
|
+
|
|
2360
|
+
text | marks the ground plane. We fixed the device location and qualitatively compared
|
2361
|
+
| the recognition results of the two algorithms (see Figure 3.14). We observed that
|
2362
|
+
| Koppula et al. reliably detect floors, table tops and front-facing chairs, but often fail
|
2363
|
+
| to detect chairs facing backwards, or distant ones. They also miss all the monitors,
|
2364
|
+
| which usually are very noisy. In contrast, our algorithm being pose- and variation-
|
2365
|
+
| aware is more stable across multiple frames, even with access to less information (we
|
2366
|
+
| do not use color). Note that while our system detected some monitors, their poses are
|
2367
|
+
| typically biased toward parts where measurements exist. In summary, for partial and
|
2368
|
+
| noisy point-clouds, the probabilistic formulation coupled with geometric reasoning
|
2369
|
+
| results in robust semantic labeling of the objects.
|
2370
|
+
blank |
|
2371
|
+
|
|
2372
|
+
title | 3.5.4 Limitations
|
2373
|
+
text | While in our tests the recognition results were mostly satisfactory (see Table 3.3),
|
2374
|
+
| we observed two main failure modes. First, our system failed to detect objects when
|
2375
|
+
| large amounts of data were missing. In real-world scenarios, our object scans could
|
2376
|
+
| easily exhibit large holes because of occlusions, specular materials, or thin structures.
|
2377
|
+
| Further, scans can be sparse and distorted for distant objects. Second, our system
|
2378
|
+
| cannot overcome the limitations of our initial segmentation. For example, if objects
|
2379
|
+
| are closer than θdist , our system groups them as a single object; while a single object
|
2380
|
+
| can be confused for multiple objects if its measurements are separated by more than
|
2381
|
+
| θdist from a particular viewpoint. Although in certain cases the algorithm can recover
|
2382
|
+
| segmentations with the help of other visible parts, this recovery becomes difficult
|
2383
|
+
| because our system allows objects to deform and hence have variable extent.
|
2384
|
+
| However, even with these limitations, our system overall reliably recognized scans
|
2385
|
+
| with 1000-3000 points per scan since in the learning phase the system extracted
|
2386
|
+
| the important degrees of variation, thus providing a compact, yet powerful, model
|
2387
|
+
| (and deformation) abstraction. In a real office settings, the simplicity and speed
|
2388
|
+
| of our framework would allow a human operator to immediately notice missed or
|
2389
|
+
| misclassified objects and quickly re-scan those areas under more favorable conditions.
|
2390
|
+
| We believe that such a progressive scanning possibility to become more common place
|
2391
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 59
|
2392
|
+
blank |
|
2393
|
+
|
|
2394
|
+
|
|
2395
|
+
text | in future acquisition setups.
|
2396
|
+
blank |
|
2397
|
+
|
|
2398
|
+
title | 3.5.5 Applications
|
2399
|
+
text | Our results suggest that our system is also useful for obtaining a high-level under-
|
2400
|
+
| standing of recognized objects, e.g., relative position, orientation, frequency of learned
|
2401
|
+
| objects. Specifically, as our system progressively scans multiple rooms populated with
|
2402
|
+
| the same objects, the system gathers valuable co-occurrence statistics (see Table 3.4).
|
2403
|
+
| For example, from the collected data, the system extracts that the orientation of audi-
|
2404
|
+
| torium chairs are consistent (i.e., face a single direction), or observe a pattern among
|
2405
|
+
| the relative orientation between a chair and its neighboring monitor. Not surprisingly,
|
2406
|
+
| our system found chairs to be more frequent in seminar rooms rather than in offices.
|
2407
|
+
| In the future, we plan to incorporate such information to handle cluttered datasets
|
2408
|
+
| while scanning similar environments but with differently shaped objects.
|
2409
|
+
blank |
|
2410
|
+
text | distance (m) angle (◦ )
|
2411
|
+
| scene relationship
|
2412
|
+
| mean std mean std
|
2413
|
+
| chair-chair 1.207 0.555 78.7 74.4
|
2414
|
+
| office
|
2415
|
+
| chair-monitor 0.943 0.164 152 39.4
|
2416
|
+
| aud. chair-chair 0.548 0 0 0
|
2417
|
+
| sem. chair-chair 0.859 0.292 34.1 47.4
|
2418
|
+
blank |
|
2419
|
+
text | Table 3.4: Statistics between objects learned for each scene category.
|
2420
|
+
blank |
|
2421
|
+
text | As an exciting possibility, the system can efficiently detect change. By change, we
|
2422
|
+
| mean introduction of a new object, previously not seen in the learning phase while
|
2423
|
+
| factoring out variations due to different spatial arrangements or changes in individual
|
2424
|
+
| model poses. For example, in the auditorium #2, a previously unobserved chair
|
2425
|
+
| is successfully detected (highlighted in yellow). Such a mode is particularly useful
|
2426
|
+
| for surveillance and automated investigation of indoor environments, or for disaster
|
2427
|
+
| planning in environments that are unsafe for humans to venture.
|
2428
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 60
|
2429
|
+
blank |
|
2430
|
+
|
|
2431
|
+
|
|
2432
|
+
title | 3.6 Conclusions
|
2433
|
+
text | We have presented a simple system for recognizing man-made objects in cluttered 3D
|
2434
|
+
| indoor environments, while factoring out low-dimensional deformations and pose vari-
|
2435
|
+
| ations, on a scale previously not demonstrated. Our pipeline can be easily extended
|
2436
|
+
| to more complex environments primarily requiring reliable acquisition of additional
|
2437
|
+
| object models and their variability modes.
|
2438
|
+
| Several future challenges and opportunities remain: (i) With an increasing number
|
2439
|
+
| of object prototypes, the system will need more sophisticated search data structures
|
2440
|
+
| in the recognition phase. We hope to benefit from recent advances in shape search.
|
2441
|
+
| (ii) We have focused on a severely restricted form of sensor input, namely, poor and
|
2442
|
+
| sparse geometry alone. We intentionally left out color and texture, which can be quite
|
2443
|
+
| beneficial, especially if appearance variations can be accounted for. (iii) A natural
|
2444
|
+
| extension would be to take the recognized models along with their pose and joint
|
2445
|
+
| attributes to create data-driven, high-quality interior CAD models for visualization,
|
2446
|
+
| or more schematic representations, that may be sufficient for indoor navigation, or
|
2447
|
+
| simply for scene understanding (see Figure 3.1, rightmost image, and recent efforts
|
2448
|
+
| in scene modeling [NXS12, SXZ+ 12]).
|
2449
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 61
|
2450
|
+
blank |
|
2451
|
+
|
|
2452
|
+
|
|
2453
|
+
text | office 1 chair monitor
|
2454
|
+
| desk
|
2455
|
+
blank |
|
2456
|
+
|
|
2457
|
+
|
|
2458
|
+
|
|
2459
|
+
text | trash bin whiteboard
|
2460
|
+
| office 2
|
2461
|
+
blank |
|
2462
|
+
|
|
2463
|
+
|
|
2464
|
+
|
|
2465
|
+
text | auditorium 1
|
2466
|
+
blank |
|
2467
|
+
|
|
2468
|
+
|
|
2469
|
+
|
|
2470
|
+
text | change
|
2471
|
+
| auditorium 2 open tables
|
2472
|
+
| detection
|
2473
|
+
blank |
|
2474
|
+
|
|
2475
|
+
|
|
2476
|
+
|
|
2477
|
+
text | open
|
2478
|
+
| seat
|
2479
|
+
blank |
|
2480
|
+
text | seminar room 1
|
2481
|
+
blank |
|
2482
|
+
|
|
2483
|
+
|
|
2484
|
+
|
|
2485
|
+
text | seminar room 2 missed chairs
|
2486
|
+
blank |
|
2487
|
+
|
|
2488
|
+
|
|
2489
|
+
|
|
2490
|
+
text | Figure 3.12: Recognition results on various office and auditorium scenes. Since the
|
2491
|
+
| input scans have limited viewpoints and thus are too poor to provide a clear represen-
|
2492
|
+
| tation of the scene complexity, we include scene images for visualization (these were
|
2493
|
+
| unavailable to the algorithm). Note that for the auditorium examples, our system
|
2494
|
+
| even detected the small tables attached to the chairs — this was possible since the
|
2495
|
+
| system extracted this variation mode in the learning phase.
|
2496
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 62
|
2497
|
+
blank |
|
2498
|
+
|
|
2499
|
+
|
|
2500
|
+
|
|
2501
|
+
text | missed monitor laptop monitor
|
2502
|
+
| chair
|
2503
|
+
blank |
|
2504
|
+
|
|
2505
|
+
|
|
2506
|
+
|
|
2507
|
+
text | drawer deformations
|
2508
|
+
blank |
|
2509
|
+
text | Figure 3.13: A close-up office scene. All of the recognized objects have one or more
|
2510
|
+
| deformation modes. The algorithm inferred the angles of the laptop screen and the
|
2511
|
+
| chair back, heights of the chair seat, the arm rests and the monitor. Note that our
|
2512
|
+
| system also captured the deformation modes of open drawers.
|
2513
|
+
meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 63
|
2514
|
+
blank |
|
2515
|
+
|
|
2516
|
+
|
|
2517
|
+
|
|
2518
|
+
text | input scene 1 input scene 2
|
2519
|
+
blank |
|
2520
|
+
|
|
2521
|
+
|
|
2522
|
+
|
|
2523
|
+
text | shifted wrong labels
|
2524
|
+
blank |
|
2525
|
+
|
|
2526
|
+
|
|
2527
|
+
|
|
2528
|
+
text | missed missed
|
2529
|
+
blank |
|
2530
|
+
|
|
2531
|
+
|
|
2532
|
+
|
|
2533
|
+
text | [Koppula et al.] ours [Koppula et al.] ours
|
2534
|
+
| table top wall floor chair base table leg monitor chair back
|
2535
|
+
blank |
|
2536
|
+
text | Figure 3.14: We compared our algorithm and Koppula et al. [KAJS11] using multiple
|
2537
|
+
| frames of scans from the same viewpoint. Our recognition results are more stable
|
2538
|
+
| across different frames.
|
2539
|
+
meta | Chapter 4
|
2540
|
+
blank |
|
2541
|
+
title | Guided Real-Time Scanning of
|
2542
|
+
| Indoor Objects3
|
2543
|
+
blank |
|
2544
|
+
text | Acquiring 3-D models of the indoor environments is a critical component for under-
|
2545
|
+
| standing and mapping the environments. For successful 3-D acquisition in indoor
|
2546
|
+
| scenes, it is necessary to simultaneously scan the environment, interpret the incom-
|
2547
|
+
| ing data stream, and plan subsequent data acquisition, all in a real-time fashion. The
|
2548
|
+
| challenge is, however, that individual frames from portable commercial 3-D scanners
|
2549
|
+
| (RGB-D cameras) can be of poor quality. Typically, complex scenes can only be
|
2550
|
+
| acquired by accumulating multiple scans. Information integration is done in a post-
|
2551
|
+
| scanning phase, when such scans are registered and merged, leading eventually to
|
2552
|
+
| useful models of the environment. Such a workflow, however, is limited by the fact
|
2553
|
+
| that poorly scanned or missing regions are only identified after the scanning process
|
2554
|
+
| is finished, when it may be costly to revisit the environment being acquired to per-
|
2555
|
+
| form additional scans. In the study presented in this chapter, we focused on real-time
|
2556
|
+
| 3D model quality assessment and data understanding, that could provide immediate
|
2557
|
+
| feedback for guidance in subsequent acquisition.
|
2558
|
+
| Evaluating acquisition quality without having any prior knowledge about an un-
|
2559
|
+
| known environment, however, is an ill-posed problem. We observe that although the
|
2560
|
+
meta | 3
|
2561
|
+
text | The contents of the chapter will be published as Y.M. Kim, N. Mitra, Q. Huang, L. Guibas,
|
2562
|
+
| Guided Real-Time Scanning of Indoor Environments, Pacific Graphics 2013.
|
2563
|
+
blank |
|
2564
|
+
|
|
2565
|
+
|
|
2566
|
+
meta | 64
|
2567
|
+
| CHAPTER 4. GUIDED REAL-TIME SCANNING 65
|
2568
|
+
blank |
|
2569
|
+
|
|
2570
|
+
|
|
2571
|
+
text | target scene itself maybe unknown, in many cases, the scene consists of objects from
|
2572
|
+
| a well-prescribed pre-defined set of object categories. Moreover, these categories are
|
2573
|
+
| well represented in publicly available 3-D shape repositories (e.g., Trimble 3D Ware-
|
2574
|
+
| house). For example, an office setting typically consists of various tables, chairs,
|
2575
|
+
| monitors, etc., all of which have thousands of instances in the Trimble 3D Ware-
|
2576
|
+
| house. In our approach, instead of attempting to reconstruct detailed 3D geometry
|
2577
|
+
| from low-quality inconsistent 3D measurements, we focus on parsing the input scans
|
2578
|
+
| into simpler geometric entities, and use existing 3D model repositories like Trimble
|
2579
|
+
| 3D warehouse as proxies to assist the process of assessing data quality. Thus, we
|
2580
|
+
| defined two key tasks that an effective acquisition method would need to complete:
|
2581
|
+
| (i) given a partially scanned object, reliably and efficiently retrieve appropriate proxy
|
2582
|
+
blank |
|
2583
|
+
|
|
2584
|
+
|
|
2585
|
+
|
|
2586
|
+
text | Figure 4.1: We introduce a real-time guided scanning system. As streaming 3D
|
2587
|
+
| data is progressively accumulated (top), the system retrieves the top matching mod-
|
2588
|
+
| els (bottom) along with their pose to act as geometric proxies to assess the current
|
2589
|
+
| scan quality, and provide guidance for subsequent acquisition frames. Only a few
|
2590
|
+
| intermediate frames with corresponding retrieved models are shown in this figure.
|
2591
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 66
|
2592
|
+
blank |
|
2593
|
+
|
|
2594
|
+
|
|
2595
|
+
text | models of it from the database; and (ii) position the retrieved models in the scene
|
2596
|
+
| and provide real-time feedback (e.g., missing geometry that still needs to be scanned)
|
2597
|
+
| to guide subsequent data gathering.
|
2598
|
+
| We introduce a novel partial shape retrieval approach for finding similar shapes
|
2599
|
+
| of a query partial scan. In our setting, we used the Microsoft Kinect to acquire
|
2600
|
+
| the scans of real objects. The proposed approach, which combines both descriptor-
|
2601
|
+
| based retrieval and registration-based verification, is able to search in a database of
|
2602
|
+
| thousands of models in real-time. To account for partial similarity between the input
|
2603
|
+
| scan and the models in a database, we created simulated scans of each database model
|
2604
|
+
| and compared a scan of real setting to a scan of simulated setting. This allowed us to
|
2605
|
+
| efficiently compare shapes using global descriptors even in the presence of only partial
|
2606
|
+
| similarity; and the approach remains robust in the case of occlusions or missing data
|
2607
|
+
| about the object being scanned.
|
2608
|
+
| Once our system finds a match, to mark out missing parts in the current merged
|
2609
|
+
| scan, the system aligns it with the retrieved model and highlights the missing part
|
2610
|
+
| or places where the scan density is low. This visual feedback allows the operator
|
2611
|
+
| to quickly adjust the scanning device for subsequent scans. In effect, our 3D model
|
2612
|
+
| database and matching algorithms make it possible for the operator to assess the
|
2613
|
+
| quality of the data being acquired and discover badly scanned or missing areas while
|
2614
|
+
| the scan is being performed, thus allowing corrective actions to be taken immediately.
|
2615
|
+
| We extensively evaluated the robustness and accuracy of our system using syn-
|
2616
|
+
| thetic data sets with available ground truth. Further, we tested our system on physical
|
2617
|
+
| environments to achieve real-time scene understanding (see the supplementary video
|
2618
|
+
| that includes the actual scanning session recorded). In summary, in this chapter, we
|
2619
|
+
| present a novel guided scanning interface and introduce a relation-based light-weight
|
2620
|
+
| descriptor for fast and accurate model retrieval and positioning to provide real-time
|
2621
|
+
| guidance for scanning.
|
2622
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 67
|
2623
|
+
blank |
|
2624
|
+
|
|
2625
|
+
|
|
2626
|
+
title | 4.1 Related Work
|
2627
|
+
blank |
|
2628
|
+
title | 4.1.1 Interactive Acquisition
|
2629
|
+
text | Fast, accurate, and autonomous model acquisition have long been primary goals in
|
2630
|
+
| robotics, computer graphics, and computer vision. With the introduction of afford-
|
2631
|
+
| able, portable, commercial RGBD cameras, there has been a pressing need to simplify
|
2632
|
+
| scene acquisition workflows to allow less experienced individuals to acquire scene ge-
|
2633
|
+
| ometries. Recent efforts fall into two broad categories: (i) combining individual
|
2634
|
+
| frames of low-quality point-cloud data with SLAM algorithms [EEH+ 11, HKH+ 12] to
|
2635
|
+
| improve scan quality [IKH+ 11]; (ii) using supervised learning to train classifiers for
|
2636
|
+
| scene labeling [RBF12] with applications to robotics [KAJS11]. Previously, [RHHL02]
|
2637
|
+
| aggregated scans at interactive rates to provide visual feedback to the user. This work
|
2638
|
+
| was recently expanded by [DHR+ 11]. [KDS+ 12] extracted simple planes and recon-
|
2639
|
+
| struct floor plans with guidance from a projector pattern. While our goal is also to
|
2640
|
+
| provide real-time feedback, our system differs from previous efforts in that it uses
|
2641
|
+
| retrieved proxy models to automatically access the current scan quality, enabling
|
2642
|
+
| guided scanning.
|
2643
|
+
blank |
|
2644
|
+
|
|
2645
|
+
title | 4.1.2 Scan Completion
|
2646
|
+
text | Various strategies have been proposed to improve noisy scans or plausibly fill in miss-
|
2647
|
+
| ing data due to occlusion: researchers have exploited repetition [PMW+ 08], symme-
|
2648
|
+
| try [TW05, MPWC12], or used primitives to complete missing parts [SWK07]. Other
|
2649
|
+
| approaches have focused on using geometric proxies and abstractions including curves,
|
2650
|
+
| skeletons, planar abstractions, etc. In the context of image understanding, indoor
|
2651
|
+
| scenes have been abstracted and modeled as a collection of simple cuboids [LGHK10,
|
2652
|
+
| ZCC+ 12] to capture a variety of man-made objects.
|
2653
|
+
blank |
|
2654
|
+
|
|
2655
|
+
title | 4.1.3 Part-Based Modeling
|
2656
|
+
text | Simple geometric primitives, however, are not always sufficiently expressive for com-
|
2657
|
+
| plex shapes. Meanwhile, such objects can still be split into simpler parts that aid
|
2658
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 68
|
2659
|
+
blank |
|
2660
|
+
|
|
2661
|
+
|
|
2662
|
+
text | shape understanding. For example, parts can act as entities for discovering rep-
|
2663
|
+
| etitions [TSS10], training classifiers [SFC+ 11, XS12], or facilitating shape synthe-
|
2664
|
+
| sis [JTRS12]. Alternately, a database of part-based 3D model templates can be used
|
2665
|
+
| to detect shapes from incomplete data [SXZ+ 12, NXS12, KMYG12]. Such methods
|
2666
|
+
| often rely on expensive matching, and thus do not lend themselves to low-memory
|
2667
|
+
| footprint real-time realizations.
|
2668
|
+
blank |
|
2669
|
+
|
|
2670
|
+
title | 4.1.4 Template-Based Completion
|
2671
|
+
text | Our system also uses database of 3D models (e.g., chairs, lamps, tables) to retrieve
|
2672
|
+
| shape from 3D scans. However, by defining a novel simple descriptor, our sys-
|
2673
|
+
| tem, compared to previous efforts, can reliably handle much larger model databases.
|
2674
|
+
| Specifically, instead of geometrically matching templates [HCI+ 11], or using templates
|
2675
|
+
| to complete missing parts [PMG+ 05], our system initially searches for consistency in
|
2676
|
+
| distribution of relation among primitive faces.
|
2677
|
+
blank |
|
2678
|
+
|
|
2679
|
+
title | 4.1.5 Shape Descriptors
|
2680
|
+
text | In the context of shape retrieval, various descriptors have been investigated for group-
|
2681
|
+
| ing, classification, or retrieval of 3D geometry. For example, the method proposed by
|
2682
|
+
| [CTSO03] uses light-field descriptors based on silhouettes, the method by [OFCD02]
|
2683
|
+
| uses shape distributions to categorize different object classes, etc. The silhouette
|
2684
|
+
| method requires an expensive rotational alignment search, limiting its usefulness in
|
2685
|
+
| our setting to a small number of models (100-200). Both methods assume access
|
2686
|
+
| to nearly complete models to match against. In contrast, for guided scanning, our
|
2687
|
+
| approach can support much larger model sets (about 2000 models) and, more impor-
|
2688
|
+
| tantly, focus on handling poor and incomplete point sets as inputs to the matcher.
|
2689
|
+
blank |
|
2690
|
+
|
|
2691
|
+
title | 4.2 Overview
|
2692
|
+
text | Figure 4.2 illustrates the pipeline of our guided real-time scanning system, which con-
|
2693
|
+
| sists of a scanning device (Kinect in our case) and a database of 3D shapes containing
|
2694
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 69
|
2695
|
+
blank |
|
2696
|
+
|
|
2697
|
+
|
|
2698
|
+
|
|
2699
|
+
text | Off-line process
|
2700
|
+
blank |
|
2701
|
+
text | Database of Simulated Similarity
|
2702
|
+
| A2h descriptor
|
2703
|
+
| 3D models scans measure
|
2704
|
+
blank |
|
2705
|
+
|
|
2706
|
+
|
|
2707
|
+
text | Retrieved
|
2708
|
+
| shape
|
2709
|
+
| …
|
2710
|
+
blank |
|
2711
|
+
|
|
2712
|
+
|
|
2713
|
+
|
|
2714
|
+
text | …
|
2715
|
+
| …
|
2716
|
+
blank |
|
2717
|
+
|
|
2718
|
+
|
|
2719
|
+
|
|
2720
|
+
text | registered Density voxel
|
2721
|
+
blank |
|
2722
|
+
text | Retrieved
|
2723
|
+
| model +
|
2724
|
+
| Segmented, pose
|
2725
|
+
| Frames of
|
2726
|
+
| registered A2h descriptor Align shape
|
2727
|
+
| measurement
|
2728
|
+
| pointcloud
|
2729
|
+
| …
|
2730
|
+
blank |
|
2731
|
+
|
|
2732
|
+
|
|
2733
|
+
|
|
2734
|
+
text | registered Densityvoxel provide
|
2735
|
+
| guidance
|
2736
|
+
blank |
|
2737
|
+
|
|
2738
|
+
text | Figure 4.2: Pipeline of the real-time guided scanning framework.
|
2739
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 70
|
2740
|
+
blank |
|
2741
|
+
|
|
2742
|
+
|
|
2743
|
+
text | the categories of the shapes present in the environment. In each iteration, the sys-
|
2744
|
+
| tem performs three tasks: (i) scan acquisition from a set of viewpoints specified by a
|
2745
|
+
| user (or a planning algorithm); (ii) shape retrieval using distribution of relations; and
|
2746
|
+
| (iii) comparison of the scanned pointset with the best retrieved model. The system
|
2747
|
+
| iterates these steps until a sufficiently good match is found (see supplementary video).
|
2748
|
+
| The challenge is how to maintain real-time response.
|
2749
|
+
blank |
|
2750
|
+
|
|
2751
|
+
title | 4.2.1 Scan Acquisition
|
2752
|
+
text | The input stream of a real-time depth sensor (in our case, the Kinect was used) is col-
|
2753
|
+
| lected and processed using an open-source implementation [EEH+ 11] that calibrates
|
2754
|
+
| the color and depth measurements and outputs the pointcloud data. The color fea-
|
2755
|
+
| tures of individual frames are then extracted and matched from consecutive frames.
|
2756
|
+
| The corresponding depth values are used to incrementally register the depth mea-
|
2757
|
+
| surements [HKH+ 12]. The pointcloud that belongs to the object is segmented as the
|
2758
|
+
| system detects the ground plane and exclude the points that belong to the plane. We
|
2759
|
+
| will refer to the segmented, registered set of depth measurements as a merged scan.
|
2760
|
+
| Whenever each new frame is processed, the system calculates the descriptor and the
|
2761
|
+
| density voxels from the pointcloud data for the merged scan.
|
2762
|
+
blank |
|
2763
|
+
|
|
2764
|
+
title | 4.2.2 Shape Retrieval
|
2765
|
+
text | Our goal is to find shapes in the database that are similar to the merged scan. Since
|
2766
|
+
| the merged scan may contain only partial information about the object being scanned,
|
2767
|
+
| our system internally generates simulated views of both the merged scan as well as
|
2768
|
+
| shapes in the database, and then compare their point clouds associated with these
|
2769
|
+
| views. The key observation is that although the merged scan may still have missing
|
2770
|
+
| geometry, it is likely that it contains all the visible geometry of the object being
|
2771
|
+
| scanned when the object is viewed from a particular point of view (i.e., the self-
|
2772
|
+
| occlusions are predictable); it thus becomes comparable to database model views
|
2773
|
+
| from the same or nearby viewpoints. Hence, the system measures shape similarity
|
2774
|
+
| between such point-cloud views. For shape retrieval, our system first performs a
|
2775
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 71
|
2776
|
+
blank |
|
2777
|
+
|
|
2778
|
+
|
|
2779
|
+
text | descriptor-based similarity search against the entire database to obtain a candidate
|
2780
|
+
| set of similar models. Finally, the system performs registration of each model with
|
2781
|
+
| the merged scan and returns the model with the best alignment score.
|
2782
|
+
| We note here that past research on global shape descriptors has mostly focused on
|
2783
|
+
| broad differentiation of shape classes, e.g., separating shapes of vehicles from those
|
2784
|
+
| of furniture or of people, etc. In our case, since the system is looking for potentially
|
2785
|
+
| modest amounts of missing geometry in the scans, we aim more for fine variability
|
2786
|
+
| differentiation among a particular object class, such as chairs. We have therefore
|
2787
|
+
| developed and exploited a novel histogram descriptor based on the angles between
|
2788
|
+
| the shape normals for this task (see Section 4.3.2).
|
2789
|
+
blank |
|
2790
|
+
|
|
2791
|
+
title | 4.2.3 Scan Evaluation
|
2792
|
+
text | Once the retrieved model is computed, the retrieved proxy is displayed for the user.
|
2793
|
+
| The system also highlights voxels with missing data when compared with the best
|
2794
|
+
| matching model, and finishes when the retrieved best match model is close enough to
|
2795
|
+
| the current measurement (when the missing voxels are less than 1% of total number
|
2796
|
+
| of voxels). In Section 4.3.4, we elaborate on this guided scanning interface.
|
2797
|
+
blank |
|
2798
|
+
|
|
2799
|
+
title | 4.3 Partial Shape Retrieval
|
2800
|
+
text | Our goal is to quickly assess the quality of the current scan and guide the user in
|
2801
|
+
| subsequent scans. This is challenging on the following counts: (i) the system has
|
2802
|
+
| to assess model quality without necessarily knowing which model is being scanned;
|
2803
|
+
| (ii) the scans are potentially incomplete, with large parts of data missing; and (iii) the
|
2804
|
+
| system should respond in real-time.
|
2805
|
+
| We observe that existing database models such as Trimble 3D warehouse models
|
2806
|
+
| can be used as proxies for evaluating scan quality of similar objects being scanned,
|
2807
|
+
| thus addressing the first challenge. Hence, for any merged query scan (i.e., point-
|
2808
|
+
| cloud) S, the system looks for a match among similar models in the database M =
|
2809
|
+
| {M1 , · · · MN }. For simplicity, we assume that the up-right orientation of each model
|
2810
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 72
|
2811
|
+
blank |
|
2812
|
+
|
|
2813
|
+
|
|
2814
|
+
text | in the model database is available in existing database.
|
2815
|
+
| To handle the second challenge, we note that missing data, even in large chunks,
|
2816
|
+
| are mostly the result of self occlusion, and hence are predictable. To address this
|
2817
|
+
| problem, our system synthetically scans the models Mi from different viewpoints to
|
2818
|
+
| simulate such self occlusions. This greatly simplifies the problem by allowing us to
|
2819
|
+
| directly compare S to the simulated scans of Mi , thus automatically accounting for
|
2820
|
+
| missing data in S.
|
2821
|
+
| Finally, to achieve real-time performance, we propose a simple, robust, yet effective
|
2822
|
+
| descriptor to match S to view-dependent scans of Mi . Subsequently, the system
|
2823
|
+
| performs registration to verify the match between each matched simulated scan and
|
2824
|
+
| the query scan, and returns the most similar simulated scan and the corresponding
|
2825
|
+
| model Mi . The following subsections provide further details of the each step for
|
2826
|
+
| partial shape retrieval.
|
2827
|
+
blank |
|
2828
|
+
|
|
2829
|
+
title | 4.3.1 View-Dependent Simulated Scans
|
2830
|
+
text | For each model Mi , the system generates simulated scans S k (Mi ) from multiple cam-
|
2831
|
+
| era positions. Let dup denote the up-right orientation for model Mi . Our system takes
|
2832
|
+
| dup as the z-axis and arbitrarily fixes any orthogonal direction di (i.e., dTi dup = 0) as
|
2833
|
+
| the x-axis. The system also translates the centroid of Mi to the origin.
|
2834
|
+
| The system then virtually positions the cameras at the surface of a view-sphere
|
2835
|
+
| around the origin. Specifically, the camera is placed at
|
2836
|
+
blank |
|
2837
|
+
text | ci := (2d cos θ sin φ, 2d sin θ sin φ, 2d cos φ)
|
2838
|
+
blank |
|
2839
|
+
text | where d denotes the length of the diagonal of the bounding box of Mi , and φ denotes
|
2840
|
+
| the camera altitude. The camera up-vector is defined as
|
2841
|
+
blank |
|
2842
|
+
text | dup − < dup , ci > ci
|
2843
|
+
| ui := with ci = ci /kci k
|
2844
|
+
| kdup − < dup , ci > ci k
|
2845
|
+
blank |
|
2846
|
+
text | and the gaze point is defined as the origin. The fields of view are set to π/2 in both
|
2847
|
+
| the up and horizontal directions.
|
2848
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 73
|
2849
|
+
blank |
|
2850
|
+
|
|
2851
|
+
|
|
2852
|
+
text | For each such camera location, our system obtains a synthetic scan using the z-
|
2853
|
+
| buffer with a grid setting of 200 × 200. Such a grid results in vertices where the grid
|
2854
|
+
| rays intersect the model. The system generates the simulated scan by computing one
|
2855
|
+
| surfel (pf , nf , df ) (i.e., a point, normal, and density, respectively) from each quad
|
2856
|
+
| face f = (qf 1 , qf 2 , qf 3 , qf 4 ), as follows,
|
2857
|
+
blank |
|
2858
|
+
text | 4
|
2859
|
+
| X X
|
2860
|
+
| pf := qf i /4, nf := nijk /4, (4.1)
|
2861
|
+
| i=1 ijk∈{123,234,341,412}
|
2862
|
+
| X
|
2863
|
+
| df := 1/ area(qf i , qf j , qf k ) (4.2)
|
2864
|
+
| ijk∈{123,234,341,412}
|
2865
|
+
blank |
|
2866
|
+
|
|
2867
|
+
text | where, nijk denotes the normal of the triangular face (qf i , qf j , qf k ) and nf ← nf /knf k.
|
2868
|
+
| Thus the simulated scans simply collects surfels generated from all the quad faces of
|
2869
|
+
| the sampling grid.
|
2870
|
+
| Our system places K samples of θ, i.e., θ = 2kπ/K where k ∈ [0, K) and φ =
|
2871
|
+
| {π/6, π/3} to obtain view-dependent simulated scans for each model Mi . Empirically,
|
2872
|
+
| we set K = 6 to balance between efficiency and quality when comparing simulated
|
2873
|
+
| scans and the merged scan S.
|
2874
|
+
blank |
|
2875
|
+
|
|
2876
|
+
title | 4.3.2 A2h Scan Descriptor
|
2877
|
+
text | Our goal is to design a descriptor that (i) is efficient to compute, (ii) is robust to
|
2878
|
+
| noise and outliers, and (iii) has a low-memory footprint. We draw inspiration from
|
2879
|
+
| shape distributions [OFCD02] that computes statistics about geometric quantities
|
2880
|
+
| that are invariant to global transforms, e.g., distances between pairs of points on
|
2881
|
+
| the models. Shape distribution descriptors, however, were designed to be resilient to
|
2882
|
+
| local geometric changes. Hence, they are ineffective in our setting, where shapes are
|
2883
|
+
| distinguished by subtle local features. Instead, our system computes the distributions
|
2884
|
+
| of angles between point normals, which better capture the local geometric features.
|
2885
|
+
| Further, since the system knows the upright direction of each shape,this information
|
2886
|
+
| is incorporated into the design of the descriptor.
|
2887
|
+
| Specifically, for each scan S (real or simulated), our system first allocates the
|
2888
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 74
|
2889
|
+
blank |
|
2890
|
+
|
|
2891
|
+
|
|
2892
|
+
text | points into three bins based on their height along the z-axis, i.e., the up-right direction.
|
2893
|
+
| Then, among the points within each bin, the system computes the distribution of
|
2894
|
+
| angles between normals of all pairs of points. The angle space is discretized using 50
|
2895
|
+
| bins between [0, π], e.g., each bin counts the frequency of normal angles within each
|
2896
|
+
| bin. We call this the A2h scan descriptor, which for each point cloud is a 50 × 3 = 150
|
2897
|
+
| dimensional vector; this collects the angle distribution within each height bin.
|
2898
|
+
| In practice, for pointclouds belonging to any merged scan, our system randomly
|
2899
|
+
| samples 10, 000 pairs of points within each height bin to speed-up the computation. In
|
2900
|
+
| our extensive tests, we found this simple descriptor to perform better than distance-
|
2901
|
+
| only histograms in distinguishing fine variability within a broad shape class (see
|
2902
|
+
| Figure 4.3).
|
2903
|
+
blank |
|
2904
|
+
|
|
2905
|
+
title | 4.3.3 Descriptor-Based Shape Matching
|
2906
|
+
text | A straightforward way to compare two descriptor vectors f1 of f2 is to take the Lp
|
2907
|
+
| norm of their difference vector f1 − f2 . However, the Lp norm can be sensitive to
|
2908
|
+
| noise and does not account for the similarity of distribution between similar curves.
|
2909
|
+
| Instead, our system uses the Earth Mover’s distance (EMD) to compare a pair of
|
2910
|
+
| distributions [RTG98]. Intuitively, given two distributions, one distribution can be
|
2911
|
+
| seen as a mass of earth properly spread in space, the other distribution as a collection
|
2912
|
+
| of holes that need to be filled with that earth. Then, the EMD measures the least
|
2913
|
+
| amount of work needed to fill the holes with earth. Here, a unit of work corresponds to
|
2914
|
+
| transporting a unit of earth by a unit of ground distance. The costs of “moving earth”
|
2915
|
+
| reflect the notion of nearness between bins; therefore the distortion, due to noise is
|
2916
|
+
| minimized. In a 1D setting, EMD with L1 norms is equivalent to calculating an L1
|
2917
|
+
| norm for cumulative distribution functions (CDF) of the distribution [Vil03]. Hence,
|
2918
|
+
| our system achieves robustness to noise at the same time complexity as calculating
|
2919
|
+
| an L1 norm between the A2h distributions. For all of the results presented below, our
|
2920
|
+
| system used EMD with L1 norms of the CDFs computed from the A2h distributions.
|
2921
|
+
| Because there are 2K view-dependent pointclouds associated with each model Mi ,
|
2922
|
+
| the system matches the query S with each such pointcloud S k (Mi ) (k = 1, 2, ..., 2K)
|
2923
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 75
|
2924
|
+
blank |
|
2925
|
+
|
|
2926
|
+
|
|
2927
|
+
text | and records the best matching score. In the end, the system returns the top 25
|
2928
|
+
| matches across the models in M.
|
2929
|
+
blank |
|
2930
|
+
|
|
2931
|
+
title | 4.3.4 Scan Registration
|
2932
|
+
text | Our system overlays the retrieved model Mi over merged scan S as follows: the system
|
2933
|
+
| first aligns the centroid of the simulated scan S k (Mi ) to match the centroid of S (note
|
2934
|
+
| that we do not force the model Mi to touch the ground), while scaling model Mi to
|
2935
|
+
| match the data. To fix the remaining 1DOF rotational ambiguity, the angle space is
|
2936
|
+
| discretized into 10◦ intervals, and the system picks the angle for which the rotated
|
2937
|
+
| model best matches the scan S. In practice, we found this refinement step necessary
|
2938
|
+
| since our view-dependent scans have coarse angular resolution (K = 6).
|
2939
|
+
| Finally, the system uses the positioned proxy model Mi to assess the quality of the
|
2940
|
+
| current scan. Specifically, the bounding box of Mi is discretized into 9 × 9 × 9 voxels
|
2941
|
+
| and the density of points that falls within the voxel location is calculated. Those
|
2942
|
+
| voxels are highlighted where the matched model has high density of points (more
|
2943
|
+
| than the average) but where there are insufficient points coming from the scan S,
|
2944
|
+
| thus providing guidance for subsequent acquisitions. The process is terminated when
|
2945
|
+
| there is less than 10 such highlighted voxels, and the best matching model is simply
|
2946
|
+
| displayed.
|
2947
|
+
blank |
|
2948
|
+
|
|
2949
|
+
title | 4.4 Interface Design
|
2950
|
+
text | The real-time system guides the user to scan an object and retrieve the closest match.
|
2951
|
+
| In our study, we used the Kinect scanner for the acquisition and the retrieval process
|
2952
|
+
| took 5-10 seconds/iteration on our unoptimized implementation. The user scans an
|
2953
|
+
| object from an operating distance of about 1-3m. The sensor data of real-time video
|
2954
|
+
| stream of depth pointcloud and color images are visible to the user at all times (see
|
2955
|
+
| Figure 4.4).
|
2956
|
+
| The user starts scanning by pointing the sensor to the ground plane. The ground
|
2957
|
+
| plane is detected if the sensor captures a dominant plane that covers more than 50% of
|
2958
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 76
|
2959
|
+
blank |
|
2960
|
+
|
|
2961
|
+
|
|
2962
|
+
text | the scene. Our system uses this plane to extract the upright direction of the captured
|
2963
|
+
| scene. When the ground plane is successfully detected, the user receives an indication
|
2964
|
+
| on the screen(Figure 4.4 top-right).
|
2965
|
+
| In a separate window, the pointcloud data corresponding to the object being cap-
|
2966
|
+
| tured is continuously displayed. The system registers the points using image features
|
2967
|
+
| and segments the object by extracting the groundplane. The displayed pointcloud
|
2968
|
+
| data is also used to calculate the descriptor and the voxel density. At the end of
|
2969
|
+
| the retrieval stage (see Section 4.3), the system retains the information between the
|
2970
|
+
| closest match of the model and the current pointcloud data. The pointcloud is over-
|
2971
|
+
| laid with two additional cues: (i) missing data in voxels as compared with the closest
|
2972
|
+
| matched model, and (ii) the 3D model of the closest match of the object. Based on
|
2973
|
+
| this guidance, the user can then acquire the next scan. The system automatically
|
2974
|
+
| stops when the matched model is similar to the captured pointcloud.
|
2975
|
+
blank |
|
2976
|
+
|
|
2977
|
+
title | 4.5 Evaluation
|
2978
|
+
text | We tested the robustness of the proposed A2h descriptor on synthetically generated
|
2979
|
+
| data against available groundtruth. Further, we let novice users use our system
|
2980
|
+
| to scan different indoor environments. The real-time guidance allowed the users to
|
2981
|
+
| effectively capture the indoor scenes (see supplementary video).
|
2982
|
+
blank |
|
2983
|
+
text | dataset # models average # points/scan
|
2984
|
+
| chair 2138 45068
|
2985
|
+
| couch 1765 129310
|
2986
|
+
| lamp 1805 11600
|
2987
|
+
| table 5239 61649
|
2988
|
+
blank |
|
2989
|
+
text | Table 4.1: Database and scan statistics.
|
2990
|
+
blank |
|
2991
|
+
|
|
2992
|
+
|
|
2993
|
+
title | 4.5.1 Model Database
|
2994
|
+
text | We considered four categories of objects (i.e., chairs, couches, lamps, tables) in our
|
2995
|
+
| implementation. For each category, we downloaded a large number of models from
|
2996
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 77
|
2997
|
+
blank |
|
2998
|
+
|
|
2999
|
+
|
|
3000
|
+
text | the Trimble 3D Warehouse (see Table 4.1) to act as proxy geometry in the online
|
3001
|
+
| scanning phase. The models were pre-scaled and moved to the origin. We syntheti-
|
3002
|
+
| cally scanned each such model from 12 different viewpoints and computed the A2h
|
3003
|
+
| descriptor for each such scan. Note that we placed the camera only above the objects
|
3004
|
+
| (altitudes of π/6 and π/3) as the input scans rarely capture the underside of the ob-
|
3005
|
+
| jects. We used the Kinect scanner to gather streaming data and used an open source
|
3006
|
+
| library [EEH+ 11] to accumulate the input data to produce merged scans.
|
3007
|
+
blank |
|
3008
|
+
|
|
3009
|
+
title | 4.5.2 Retrieval Results with Simulated Data
|
3010
|
+
text | The proposed A2h descriptor is effective in retrieving similar shapes in fractions of
|
3011
|
+
| seconds. Figure 4.5, 4.6, 4.7, and 4.8 show typical retrieval results. In our tests, we
|
3012
|
+
| found the retrieval results to be useful for chairs and couches, which have a wider
|
3013
|
+
| variation of angles compared to lamps or tables, the shape of which is almost always
|
3014
|
+
| very symmetric.
|
3015
|
+
blank |
|
3016
|
+
title | Effect of Viewpoints
|
3017
|
+
blank |
|
3018
|
+
text | The scanned data often have significant parts missing, mainly due to self-occlusion.
|
3019
|
+
| We simulated this effect on the A2h descriptor-based retrieval and compared the
|
3020
|
+
| performance against retrieval with merged (simulated) scans, Figure 4.9. We found
|
3021
|
+
| the retrieval results to be robust and the models sufficiently representative to be used
|
3022
|
+
| as proxies for subsequent model assessment.
|
3023
|
+
blank |
|
3024
|
+
title | Comparison with Other Descriptors
|
3025
|
+
blank |
|
3026
|
+
text | We also tested existing shape descriptors: silhouette-based light field descriptor [CTSO03],
|
3027
|
+
| local spin image [Joh97], and the D2 descriptor [OFCD02]. In all the cases, we found
|
3028
|
+
| our A2h descriptor to be more effective in quickly resolving local geometric changes,
|
3029
|
+
| particularly for low quality partial pointclouds. In contrast, we found the light field
|
3030
|
+
| descriptor to be more susceptible to noise, local spin image more expensive to com-
|
3031
|
+
| pute, and the D2 descriptor less able to distinguish between local variations than our
|
3032
|
+
| A2h descriptor (see Figure 4.3).
|
3033
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 78
|
3034
|
+
blank |
|
3035
|
+
|
|
3036
|
+
|
|
3037
|
+
text | We next evaluated the degradation in the retrieval results under perturbations in
|
3038
|
+
| sampling density and noise.
|
3039
|
+
blank |
|
3040
|
+
title | Effect of Density
|
3041
|
+
blank |
|
3042
|
+
text | During scanning, points are sampled uniformly on the sensor grid, instead of uniformly
|
3043
|
+
| on the model surface. This uniform sampling on the sensor grid results in varying
|
3044
|
+
| densities of scanned points depending on the viewpoint. Our system compensates for
|
3045
|
+
| this effect by assigning probabilities that are inversely proportional to the density of
|
3046
|
+
| sample points.
|
3047
|
+
| Figure 4.10 shows the effect of density compensation on the histogram distribu-
|
3048
|
+
| tions. We tested two different combination of viewpoints and compared the distribu-
|
3049
|
+
| tions, using sampling based on uniform distribution or inversely proportional to the
|
3050
|
+
| density. Density-aware sampling are indicated by dotted lines. The overall shapes
|
3051
|
+
| of the graphs are similar for uniform and density-aware samplings. However, the ab-
|
3052
|
+
| solute values on the peaks are observed at similar heights while using density-aware
|
3053
|
+
| sampling. Hence, our system uses density-aware sampling to achieve robustness to
|
3054
|
+
| sampling variations.
|
3055
|
+
blank |
|
3056
|
+
title | Effect of Noise
|
3057
|
+
blank |
|
3058
|
+
text | In Figure 4.11, we show the robustness of A2h histograms under noise. Generally, the
|
3059
|
+
| histograms become smoother under increasing noise as subtle orientation variations
|
3060
|
+
| get masked. For reference, the Kinect measurements from a distance range of 1-2m
|
3061
|
+
| have noise perturbations comparable to 0.005 noise in the simulated data. We added
|
3062
|
+
| synthetic Gaussian noise on the simulated data to calculate the A2h descriptors to
|
3063
|
+
| better simulate the shape of the histogram.
|
3064
|
+
blank |
|
3065
|
+
|
|
3066
|
+
title | 4.5.3 Retrieval Results with Real Data
|
3067
|
+
text | Figure 4.12 shows retrieval results on a range of objects (i.e., chairs, couches, lamps,
|
3068
|
+
| and tables). Overall we found the guided interface to work well in practice. The
|
3069
|
+
| performance was better for chairs and couches, while for lamps and tables, the thin
|
3070
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 79
|
3071
|
+
blank |
|
3072
|
+
|
|
3073
|
+
|
|
3074
|
+
text | structures led to some failure cases. In all cases, the system successfully handled
|
3075
|
+
| missing data as high as 40-60% of the object surface (or half of the object surface
|
3076
|
+
| invisible) and the response of the system was at interactive rates. Note that for
|
3077
|
+
| testing purposes we manually pruned the input database models to leave out models
|
3078
|
+
| (if any) that looked very similar to the target objects to be scanned. Please refer to
|
3079
|
+
| the supplementary video for the system in action.
|
3080
|
+
blank |
|
3081
|
+
|
|
3082
|
+
title | 4.6 Conclusions
|
3083
|
+
text | We have presented a real-time guided scanning setup for online quality assessment of
|
3084
|
+
| streaming RGBD data obtained while acquiring indoor environments. The proposed
|
3085
|
+
| approach is motivated by three key observations: (i) indoor scenes largely consist of
|
3086
|
+
| a few different types of objects, each of which can be reasonably approximated by
|
3087
|
+
| commonly available 3D model sets; (ii) data is often missed due to self-occlusions,
|
3088
|
+
| and hence such missing regions can be predicted by comparisons against synthetically
|
3089
|
+
| scanned database models from multiple viewpoints; and (iii) streaming scan data can
|
3090
|
+
| be robustly and effectively compared against simulated scans by a direct comparison
|
3091
|
+
| of the distribution of relative local orientations in the two types of scans. The best
|
3092
|
+
| retrieved model is then used as a proxy to evaluate the quality of the current scan and
|
3093
|
+
| guide subsequent acquisition frames. We have demonstrated the real-time system on
|
3094
|
+
| a large number of synthetic and real-world examples with a database of 3D models,
|
3095
|
+
| often ranging in a few thousands.
|
3096
|
+
| In the future, we would like to extend our guided system to create online recon-
|
3097
|
+
| structions while specifically focusing on generating semantically valid scene models.
|
3098
|
+
| Using context information in the form of co-occurrence cues (e.g., a keyboard and
|
3099
|
+
| mouse are usually near each other) can prove to be effective. Finally, we plan to use
|
3100
|
+
| GPU-based optimized codes to handle additional categories of 3D models.
|
3101
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 80
|
3102
|
+
blank |
|
3103
|
+
|
|
3104
|
+
|
|
3105
|
+
|
|
3106
|
+
text | D2
|
3107
|
+
blank |
|
3108
|
+
|
|
3109
|
+
|
|
3110
|
+
|
|
3111
|
+
text | A2h
|
3112
|
+
blank |
|
3113
|
+
|
|
3114
|
+
|
|
3115
|
+
text | query
|
3116
|
+
| aligned model
|
3117
|
+
blank |
|
3118
|
+
|
|
3119
|
+
|
|
3120
|
+
|
|
3121
|
+
text | D2
|
3122
|
+
blank |
|
3123
|
+
|
|
3124
|
+
|
|
3125
|
+
|
|
3126
|
+
text | A2h
|
3127
|
+
blank |
|
3128
|
+
|
|
3129
|
+
|
|
3130
|
+
text | query
|
3131
|
+
| aligned model
|
3132
|
+
blank |
|
3133
|
+
text | Figure 4.3: Representative shape retrieval results using the D2 descriptor( [OFCD02],
|
3134
|
+
| first row), the A2h descriptor introduced in this chapter (Section 4.3.2, second row),
|
3135
|
+
| and the aligned models after scan registration (Section 4.3.4, third row) on the top 25
|
3136
|
+
| matches from A2h. For each method, we only show the top 4 matches. The D2 and
|
3137
|
+
| A2h descriptor (first two rows) are compared by histogram distributions, which is a
|
3138
|
+
| quick and efficient. Empirically, we observed the A2h descriptor to better capture
|
3139
|
+
| local geometric features compared to the D2 descriptor, with local registration further
|
3140
|
+
| improving the retrieval quality. The comparison based on 3D alignment (third row)
|
3141
|
+
| is more accurate, but require more computation time, and cannot be performed in
|
3142
|
+
| real-time given the size of our database of models.
|
3143
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 81
|
3144
|
+
blank |
|
3145
|
+
|
|
3146
|
+
|
|
3147
|
+
text | scanning setup
|
3148
|
+
blank |
|
3149
|
+
|
|
3150
|
+
|
|
3151
|
+
|
|
3152
|
+
text | detected groundplane
|
3153
|
+
blank |
|
3154
|
+
|
|
3155
|
+
|
|
3156
|
+
text | scanning guidance
|
3157
|
+
blank |
|
3158
|
+
|
|
3159
|
+
|
|
3160
|
+
|
|
3161
|
+
text | current scan
|
3162
|
+
blank |
|
3163
|
+
|
|
3164
|
+
|
|
3165
|
+
|
|
3166
|
+
text | current scan retreived model
|
3167
|
+
blank |
|
3168
|
+
text | Figure 4.4: The proposed guided real-time scanning setup is simple to use. The
|
3169
|
+
| user starts by scanning using a Microsoft Kinect (top-left). The system first detects
|
3170
|
+
| the ground plane and the user is notified (top-right). The current pointcloud corre-
|
3171
|
+
| sponding to the target object is displayed in the 3D view window, the best matching
|
3172
|
+
| database model is retrieved (overlaid in transparent white), and the predicted missing
|
3173
|
+
| voxels are highlighted as yellow voxels (middle-right). Based on the provided guid-
|
3174
|
+
| ance, the user acquires the next frame of data, and the process continues. Our method
|
3175
|
+
| stops when the retrieved shape explains well the captured pointcloud. Finally, the
|
3176
|
+
| overlaid 3D shape is highlighted in white (bottom-right). Note that the accumulated
|
3177
|
+
| scans have significant parts missing in most scanning steps.
|
3178
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 82
|
3179
|
+
blank |
|
3180
|
+
|
|
3181
|
+
|
|
3182
|
+
|
|
3183
|
+
text | Figure 4.5: Retrieval results with simulated data using a chair data set. Given the
|
3184
|
+
| model in the first column, the database of 2138 models are matched using the A2h
|
3185
|
+
| descriptor, and the top 5 matches are shown.
|
3186
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 83
|
3187
|
+
blank |
|
3188
|
+
|
|
3189
|
+
|
|
3190
|
+
|
|
3191
|
+
text | Figure 4.6: Retrieval results with simulated data using a couch data set. Given the
|
3192
|
+
| model in the first column, the database of 1765 models are matched using the A2h
|
3193
|
+
| descriptor, and the top 5 matches are shown.
|
3194
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 84
|
3195
|
+
blank |
|
3196
|
+
|
|
3197
|
+
|
|
3198
|
+
|
|
3199
|
+
text | Figure 4.7: Retrieval results with simulated data using a lamp data set. Given the
|
3200
|
+
| model in the first column, the database of 1805 models are matched using the A2h
|
3201
|
+
| descriptor, and the top 5 matches are shown.
|
3202
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 85
|
3203
|
+
blank |
|
3204
|
+
|
|
3205
|
+
|
|
3206
|
+
|
|
3207
|
+
text | Figure 4.8: Retrieval results with simulated data using a table data set. Given the
|
3208
|
+
| model in the first column, the database of 5239 models are matched using the A2h
|
3209
|
+
| descriptor, and the top 5 matches are shown.
|
3210
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 86
|
3211
|
+
blank |
|
3212
|
+
|
|
3213
|
+
|
|
3214
|
+
|
|
3215
|
+
text | View-dependent
|
3216
|
+
blank |
|
3217
|
+
|
|
3218
|
+
|
|
3219
|
+
text | Query object Merged scan
|
3220
|
+
blank |
|
3221
|
+
|
|
3222
|
+
|
|
3223
|
+
|
|
3224
|
+
text | View-dependent
|
3225
|
+
blank |
|
3226
|
+
|
|
3227
|
+
|
|
3228
|
+
text | Query object
|
3229
|
+
| Merged scan
|
3230
|
+
blank |
|
3231
|
+
|
|
3232
|
+
|
|
3233
|
+
|
|
3234
|
+
text | View-dependent
|
3235
|
+
blank |
|
3236
|
+
|
|
3237
|
+
text | Query object
|
3238
|
+
| Merged scan
|
3239
|
+
blank |
|
3240
|
+
text | Figure 4.9: Comparison between retrieval with view-dependant and merged scans.
|
3241
|
+
| The models are sorted by matching scores, with lower scores denoting better matches.
|
3242
|
+
| The leftmost images show the query scans. Note that the view-dependent scan-based
|
3243
|
+
| retrieval are robust even with significant missing regions (∼30-50%). The numbers
|
3244
|
+
| in parenthesis denote the view index.
|
3245
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 87
|
3246
|
+
blank |
|
3247
|
+
|
|
3248
|
+
|
|
3249
|
+
|
|
3250
|
+
text | Figure 4.10: Effect of density-aware sampling on two different combination of views
|
3251
|
+
| (comb1 and comb2). The sampling that considers the density of points are comb1d
|
3252
|
+
| and comb2d , respectively.
|
3253
|
+
blank |
|
3254
|
+
|
|
3255
|
+
|
|
3256
|
+
|
|
3257
|
+
text | Figure 4.11: Effect of noise. The shape of histogram becomes smoother as the level
|
3258
|
+
| of noise increases.
|
3259
|
+
meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 88
|
3260
|
+
blank |
|
3261
|
+
|
|
3262
|
+
|
|
3263
|
+
text | image accumulated
|
3264
|
+
| proxy model scan
|
3265
|
+
| retrieved
|
3266
|
+
blank |
|
3267
|
+
|
|
3268
|
+
|
|
3269
|
+
|
|
3270
|
+
text | chairs
|
3271
|
+
blank |
|
3272
|
+
|
|
3273
|
+
|
|
3274
|
+
|
|
3275
|
+
text | couches
|
3276
|
+
blank |
|
3277
|
+
|
|
3278
|
+
|
|
3279
|
+
|
|
3280
|
+
text | lamps
|
3281
|
+
blank |
|
3282
|
+
|
|
3283
|
+
|
|
3284
|
+
|
|
3285
|
+
text | tables
|
3286
|
+
blank |
|
3287
|
+
|
|
3288
|
+
text | Figure 4.12: Real-time retrieval results on various datasets. For each set, we show
|
3289
|
+
| the image of the object being scanned, the accumulated pointcloud, and the closest
|
3290
|
+
| shape retrieved model, along with the top 25 candidates that are picked from the
|
3291
|
+
| database of thousands of models using the proposed A2h descriptor.
|
3292
|
+
meta | Chapter 5
|
3293
|
+
blank |
|
3294
|
+
title | Conclusions
|
3295
|
+
blank |
|
3296
|
+
text | 3-D reconstruction in indoor environment is a challenging problem because of the
|
3297
|
+
| complexity and variety of the objects present, and frequent changes in positions of
|
3298
|
+
| objects made by the people who inhabit space. Based on recent technology, the
|
3299
|
+
| work presented in this dissertation frames the reconstruction of indoor environment
|
3300
|
+
| as light-weight systems.
|
3301
|
+
| RGB-D cameras (e.g., Microsoft Kinect) are a new type of sensor and the standard
|
3302
|
+
| for utilizing the data is not yet fully established. Still, the sensor is revolutionary
|
3303
|
+
| because it is an affordable technology that can capture the 3-D data of everyday
|
3304
|
+
| environments at video frame rate. This dissertation covers quick pipelines that allow
|
3305
|
+
| possible real-time interaction between the user and the system. However, such data
|
3306
|
+
| comes at the price of complex noise characteristics.
|
3307
|
+
| To reconstruct the challenging indoor structures with limited data, we imposed
|
3308
|
+
| different geometric priors depending on the target applications and aimed for high-
|
3309
|
+
| level understanding. In chapter 2, we present a pipeline to acquire floor plans using
|
3310
|
+
| large planes as a geometric prior. We followed the well-known Manhattan-world
|
3311
|
+
| assumption and utilized user feedback to overcome ambiguous situations and specify
|
3312
|
+
| the important planes to be included in the model. Chapter 3 described our use
|
3313
|
+
| of simple models of repeating objects with deformation modes. Public places with
|
3314
|
+
| many of repeating objects can be reconstructed by recovering the low-dimensional
|
3315
|
+
| deformation and placement information. Chapter 4 showed how we retrieve complex
|
3316
|
+
blank |
|
3317
|
+
|
|
3318
|
+
meta | 89
|
3319
|
+
| CHAPTER 5. CONCLUSIONS 90
|
3320
|
+
blank |
|
3321
|
+
|
|
3322
|
+
|
|
3323
|
+
text | shape of objects with the help of a large database of 3-D models, as we develop a
|
3324
|
+
| descriptor that can be computed and searched efficiently and allow online quality
|
3325
|
+
| assessment to be presented to the user.
|
3326
|
+
| Each of the pipelines presented in these chapters targets at a specific application
|
3327
|
+
| and has been evaluated accordingly. The work of the dissertation can be extended
|
3328
|
+
| into other possible real-life applications that can connect actual environments with
|
3329
|
+
| the virtual world. The depth data from RGB-D cameras is easy to acquire, but we
|
3330
|
+
| still do not know how to make full use of the massive amount of information produced.
|
3331
|
+
| The potential applications can benefit from better understanding and handling of the
|
3332
|
+
| data. As one extension, we are interested in scaling the database of models and data
|
3333
|
+
| with special attention paid to data structure. The research community and others
|
3334
|
+
| would also benefit from the advances made in the use of reliable depth and color
|
3335
|
+
| features in the new type of data obtained from the RGB-D sensors in addition to the
|
3336
|
+
| presented descriptor.
|
3337
|
+
meta | Bibliography
|
3338
|
+
blank |
|
3339
|
+
ref | [BAD10] Soonmin Bae, Aseem Agarwala, and Fredo Durand. Computational
|
3340
|
+
| rephotography. ACM Trans. Graph., 29(5), 2010.
|
3341
|
+
blank |
|
3342
|
+
ref | [BM92] Paul J. Besl and Neil D. McKay. A method for registration of 3-D
|
3343
|
+
| shapes. IEEE PAMI, 14(2):239–256, 1992.
|
3344
|
+
blank |
|
3345
|
+
ref | [CTSO03] Ding-Yun Chen, Xiao-Pei Tian, Yu-Te Shen, and Ming Ouhyoung. On
|
3346
|
+
| visual similarity based 3d model retrieval. CGF, 22(3):223–232, 2003.
|
3347
|
+
blank |
|
3348
|
+
ref | [CY99] James M. Coughlan and A. L. Yuille. Manhattan world: Compass
|
3349
|
+
| direction from a single image by bayesian inference. In ICCV, pages
|
3350
|
+
| 941–947, 1999.
|
3351
|
+
blank |
|
3352
|
+
ref | [CZ11] Will Chang and Matthias Zwicker. Global registration of dynamic range
|
3353
|
+
| scans for articulated model reconstruction. ACM TOG, 30(3):26:1–
|
3354
|
+
| 26:15, 2011.
|
3355
|
+
blank |
|
3356
|
+
ref | [Dey07] T. K. Dey. Curve and Surface Reconstruction : Algorithms with Math-
|
3357
|
+
| ematical Analysis. Cambridge University Press, 2007.
|
3358
|
+
blank |
|
3359
|
+
ref | [DHR+ 11] Hao Du, Peter Henry, Xiaofeng Ren, Marvin Cheng, Dan B. Goldman,
|
3360
|
+
| Steven M. Seitz, and Dieter Fox. Interactive 3d modeling of indoor
|
3361
|
+
| environments with a consumer depth camera. In Proc. Ubiquitous com-
|
3362
|
+
| puting, pages 75–84, 2011.
|
3363
|
+
blank |
|
3364
|
+
ref | [EEH+ 11] Nikolas Engelhard, Felix Endres, Jürgen Hess, Jürgen Sturm, and Wol-
|
3365
|
+
| fram Burgard. Real-time 3D visual SLAM with a hand-held RGB-D
|
3366
|
+
blank |
|
3367
|
+
meta | 91
|
3368
|
+
| BIBLIOGRAPHY 92
|
3369
|
+
blank |
|
3370
|
+
|
|
3371
|
+
|
|
3372
|
+
ref | camera. In Proc. of the RGB-D Workshop on 3D Perception in Robotics
|
3373
|
+
| at the European Robotics Forum, 2011.
|
3374
|
+
blank |
|
3375
|
+
ref | [FB81] Martin A. Fischler and Robert C. Bolles. Random sample consensus:
|
3376
|
+
| a paradigm for model fitting with applications to image analysis and
|
3377
|
+
| automated cartography. Commun. ACM, 24(6):381–395, June 1981.
|
3378
|
+
blank |
|
3379
|
+
ref | [FCSS09] Y. Furukawa, B. Curless, S.M. Seitz, and R. Szeliski. Reconstructing
|
3380
|
+
| building interiors from images. In ICCV, pages 80–87, 2009.
|
3381
|
+
blank |
|
3382
|
+
ref | [FSH11] Matthew Fisher, Manolis Savva, and Pat Hanrahan. Characterizing
|
3383
|
+
| structural relationships in scenes using graph kernels. ACM TOG,
|
3384
|
+
| 30(4):34:1–34:11, 2011.
|
3385
|
+
blank |
|
3386
|
+
ref | [GCCMC08] Andrew P. Gee, Denis Chekhlov, Andrew Calway, and Walterio Mayol-
|
3387
|
+
| Cuevas. Discovering higher level structure in visual slam. IEEE Trans-
|
3388
|
+
| actions on Robotics, 24(5):980–990, October 2008.
|
3389
|
+
blank |
|
3390
|
+
ref | [GEH10] Abhinav Gupta, Alexei A. Efros, and Martial Hebert. Blocks world re-
|
3391
|
+
| visited: Image understanding using qualitative geometry and mechan-
|
3392
|
+
| ics. In ECCV, pages 482–496, 2010.
|
3393
|
+
blank |
|
3394
|
+
ref | [HCI+ 11] S. Hinterstoisser, S. Holzer, C. Cagniart, S. Ilic, K. Konolige, N. Navab,
|
3395
|
+
| and V. Lepetit. Multimodal templates for real-time detection of texture-
|
3396
|
+
| less objects in heavily cluttered scenes. ICCV, 2011.
|
3397
|
+
blank |
|
3398
|
+
ref | [HKG11] Qixing Huang, Vladlen Koltun, and Leonidas Guibas. Joint-shape seg-
|
3399
|
+
| mentation with linear programming. ACM TOG (SIGGRAPH Asia),
|
3400
|
+
| 30(6):125:1–125:11, 2011.
|
3401
|
+
blank |
|
3402
|
+
ref | [HKH+ 12] Peter Henry, Michael Krainin, Evan Herbst, Xiaofeng Ren, and Dieter
|
3403
|
+
| Fox. RGBD mapping: Using kinect-style depth cameras for dense 3D
|
3404
|
+
| modeling of indoor environments. I. J. Robotic Res., 31(5):647–663,
|
3405
|
+
| 2012.
|
3406
|
+
meta | BIBLIOGRAPHY 93
|
3407
|
+
blank |
|
3408
|
+
|
|
3409
|
+
|
|
3410
|
+
ref | [IKH+ 11] Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard
|
3411
|
+
| Newcombe, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Free-
|
3412
|
+
| man, Andrew Davison, and Andrew Fitzgibbon. Kinectfusion: real-time
|
3413
|
+
| 3D reconstruction and interaction using a moving depth camera. In
|
3414
|
+
| Proc. UIST, pages 559–568, 2011.
|
3415
|
+
blank |
|
3416
|
+
ref | [Joh97] Andrew Johnson. Spin-Images: A Representation for 3-D Surface
|
3417
|
+
| Matching. PhD thesis, Robotics Institute, CMU, 1997.
|
3418
|
+
blank |
|
3419
|
+
ref | [JTRS12] Arjun Jain, Thorsten Thormahlen, Tobias Ritschel, and Hans-Peter Sei-
|
3420
|
+
| del. Exploring shape variations by 3d-model decomposition and part-
|
3421
|
+
| based recombination. CGF (EUROGRAPHICS), 31(2):631–640, 2012.
|
3422
|
+
blank |
|
3423
|
+
ref | [KAJS11] H.S. Koppula, A. Anand, T. Joachims, and A. Saxena. Semantic la-
|
3424
|
+
| beling of 3D point clouds for indoor scenes. In NIPS, pages 244–252,
|
3425
|
+
| 2011.
|
3426
|
+
blank |
|
3427
|
+
ref | [KDS+ 12] Young Min Kim, Jennifer Dolson, Michael Sokolsky, Vladlen Koltun,
|
3428
|
+
| and Sebastian Thrun. Interactive acquisition of residential floor plans.
|
3429
|
+
| In ICRA, pages 3055–3062, 2012.
|
3430
|
+
blank |
|
3431
|
+
ref | [KMYG12] Young Min Kim, Niloy J. Mitra, Dong-Ming Yan, and Leonidas Guibas.
|
3432
|
+
| Acquiring 3d indoor environments with variability and repetition. ACM
|
3433
|
+
| TOG, 31(6), 2012.
|
3434
|
+
blank |
|
3435
|
+
ref | [LAGP09] Hao Li, Bart Adams, Leonidas J. Guibas, and Mark Pauly. Robust
|
3436
|
+
| single-view geometry and motion reconstruction. ACM TOG (SIG-
|
3437
|
+
| GRAPH), 28(5):175:1–175:10, 2009.
|
3438
|
+
blank |
|
3439
|
+
ref | [LGHK10] David Changsoo Lee, Abhinav Gupta, Martial Hebert, and Takeo
|
3440
|
+
| Kanade. Estimating spatial layout of rooms using volumetric reasoning
|
3441
|
+
| about objects and surfaces. In NIPS, pages 1288–1296, 2010.
|
3442
|
+
blank |
|
3443
|
+
ref | [LH05] Marius Leordeanu and Martial Hebert. A spectral technique for cor-
|
3444
|
+
| respondence problems using pairwise constraints. In ICCV, volume 2,
|
3445
|
+
| pages 1482–1489, 2005.
|
3446
|
+
meta | BIBLIOGRAPHY 94
|
3447
|
+
blank |
|
3448
|
+
|
|
3449
|
+
|
|
3450
|
+
ref | [MFO+ 07] Niloy J. Mitra, Simon Flory, Maks Ovsjanikov, Natasha Gelfand,
|
3451
|
+
| Leonidas Guibas, and Helmut Pottmann. Dynamic geometry registra-
|
3452
|
+
| tion. In Symp. on Geometry Proc., pages 173–182, 2007.
|
3453
|
+
blank |
|
3454
|
+
ref | [Mic10] MicroSoft. Kinect for xbox 360. http://www.xbox.com/en-US/kinect,
|
3455
|
+
| November 2010.
|
3456
|
+
blank |
|
3457
|
+
ref | [MM09] Pranav Mistry and Pattie Maes. Sixthsense: a wearable gestural in-
|
3458
|
+
| terface. In SIGGRAPH ASIA Art Gallery & Emerging Technologies,
|
3459
|
+
| page 85, 2009.
|
3460
|
+
blank |
|
3461
|
+
ref | [MPWC12] Niloy J. Mitra, Mark Pauly, Michael Wand, and Duygu Ceylan. Symme-
|
3462
|
+
| try in 3d geometry: Extraction and applications. In EUROGRAPHICS
|
3463
|
+
| State-of-the-art Report, 2012.
|
3464
|
+
blank |
|
3465
|
+
ref | [MYY+ 10] N. Mitra, Y.-L. Yang, D.-M. Yan, W. Li, and M. Agrawala. Illus-
|
3466
|
+
| trating how mechanical assemblies work. ACM TOG (SIGGRAPH),
|
3467
|
+
| 29(4):58:1–58:12, 2010.
|
3468
|
+
blank |
|
3469
|
+
ref | [MZL+ 09] Ravish Mehra, Qingnan Zhou, Jeremy Long, Alla Sheffer, Amy Gooch,
|
3470
|
+
| and Niloy J. Mitra. Abstraction of man-made shapes. ACM TOG
|
3471
|
+
| (SIGGRAPH Asia), 28(5):#137, 1–10, 2009.
|
3472
|
+
blank |
|
3473
|
+
ref | [ND10] Richard A. Newcombe and Andrew J. Davison. Live dense reconstruc-
|
3474
|
+
| tion with a single moving camera. In CVPR, 2010.
|
3475
|
+
blank |
|
3476
|
+
ref | [NXS12] Liangliang Nan, Ke Xie, and Andrei Sharf. A search-classify approach
|
3477
|
+
| for cluttered indoor scene understanding. ACM TOG (SIGGRAPH
|
3478
|
+
| Asia), 31(6), 2012.
|
3479
|
+
blank |
|
3480
|
+
ref | [OFCD02] Robert Osada, Thomas Funkhouser, Bernard Chazelle, and David
|
3481
|
+
| Dobkin. Shape distributions. ACM Transactions on Graphics,
|
3482
|
+
| 21(4):807–832, October 2002.
|
3483
|
+
meta | BIBLIOGRAPHY 95
|
3484
|
+
blank |
|
3485
|
+
|
|
3486
|
+
|
|
3487
|
+
ref | [OLGM11] Maks Ovsjanikov, Wilmot Li, Leonidas Guibas, and Niloy J. Mitra.
|
3488
|
+
| Exploration of continuous variability in collections of 3D shapes. ACM
|
3489
|
+
| TOG (SIGGRAPH), 30(4):33:1–33:10, 2011.
|
3490
|
+
blank |
|
3491
|
+
ref | [PMG+ 05] Mark Pauly, Niloy J. Mitra, Joachim Giesen, Markus Gross, and
|
3492
|
+
| Leonidas J. Guibas. Example-based 3D scan completion. In Symp.
|
3493
|
+
| on Geometry Proc., pages 23–32, 2005.
|
3494
|
+
blank |
|
3495
|
+
ref | [PMW+ 08] M. Pauly, N. J. Mitra, J. Wallner, H. Pottmann, and L. Guibas. Discov-
|
3496
|
+
| ering structural regularity in 3D geometry. ACM TOG (SIGGRAPH),
|
3497
|
+
| 27(3):43:1–43:11, 2008.
|
3498
|
+
blank |
|
3499
|
+
ref | [RBF12] Xiaofeng Ren, Liefeng Bo, and D. Fox. RGB-D scene labeling: Features
|
3500
|
+
| and algorithms. In CVPR, pages 2759 – 2766, 2012.
|
3501
|
+
blank |
|
3502
|
+
ref | [RHHL02] Szymon Rusinkiewicz, Olaf Hall-Holt, and Marc Levoy. Real-time 3D
|
3503
|
+
| model acquisition. ACM TOG (SIGGRAPH), 21(3):438–446, 2002.
|
3504
|
+
blank |
|
3505
|
+
ref | [RL01] Szymon Rusinkiewicz and Marc Levoy. Efficient variants of the icp
|
3506
|
+
| algorithm. In Proc. 3DIM, 2001.
|
3507
|
+
blank |
|
3508
|
+
ref | [RTG98] Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. A metric for
|
3509
|
+
| distributions with applications to image databases. In ICCV, pages
|
3510
|
+
| 59–, 1998.
|
3511
|
+
blank |
|
3512
|
+
ref | [SFC+ 11] Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark
|
3513
|
+
| Finocchio, Richard Moore, Alex Kipman, and Andrew Blake. Real-
|
3514
|
+
| time human pose recognition in parts from a single depth image. In
|
3515
|
+
| CVPR, pages 1297–1304, 2011.
|
3516
|
+
blank |
|
3517
|
+
ref | [SvKK+ 11] Oana Sidi, Oliver van Kaick, Yanir Kleiman, Hao Zhang, and Daniel
|
3518
|
+
| Cohen-Or. Unsupervised co-segmentation of a set of shapes via
|
3519
|
+
| descriptor-space spectral clustering. ACM TOG (SIGGRAPH Asia),
|
3520
|
+
| 30(6):126:1–126:10, 2011.
|
3521
|
+
meta | BIBLIOGRAPHY 96
|
3522
|
+
blank |
|
3523
|
+
|
|
3524
|
+
|
|
3525
|
+
ref | [SWK07] Ruwen Schnabel, Roland Wahl, and Reinhard Klein. Efficient RANSAC
|
3526
|
+
| for point-cloud shape detection. CGF (EUROGRAPHICS), 26(2):214–
|
3527
|
+
| 226, 2007.
|
3528
|
+
blank |
|
3529
|
+
ref | [SWWK08] Ruwen Schnabel, Raoul Wessel, Roland Wahl, and Reinhard Klein.
|
3530
|
+
| Shape recognition in 3D point-clouds. In Proc. WSCG, pages 65–72,
|
3531
|
+
| 2008.
|
3532
|
+
blank |
|
3533
|
+
ref | [SXZ+ 12] Tianjia Shao, Weiwei Xu, Kun Zhou, Jingdong Wang, Dongping Li, and
|
3534
|
+
| Baining Guo. An interactive approach to semantic modeling of indoor
|
3535
|
+
| scenes with an RGBD camera. ACM TOG (SIGGRAPH Asia), 31(6),
|
3536
|
+
| 2012.
|
3537
|
+
blank |
|
3538
|
+
ref | [Thr02] S. Thrun. Robotic mapping: A survey. In G. Lakemeyer and B. Nebel,
|
3539
|
+
| editors, Exploring Artificial Intelligence in the New Millenium. Morgan
|
3540
|
+
| Kaufmann, 2002.
|
3541
|
+
blank |
|
3542
|
+
ref | [TMHF00] Bill Triggs, Philip F. McLauchlan, Richard I. Hartley, and Andrew W.
|
3543
|
+
| Fitzgibbon. Bundle adjustment - a modern synthesis. In Proceedings of
|
3544
|
+
| the International Workshop on Vision Algorithms: Theory and Practice,
|
3545
|
+
| ICCV ’99. Springer-Verlag, 2000.
|
3546
|
+
blank |
|
3547
|
+
ref | [TSS10] R. Triebel, J. Shin, and R. Siegwart. Segmentation and unsupervised
|
3548
|
+
| part-based discovery of repetitive objects. In Proceedings of Robotics:
|
3549
|
+
| Science and Systems, 2010.
|
3550
|
+
blank |
|
3551
|
+
ref | [TW05] Sebastian Thrun and Ben Wegbreit. Shape from symmetry. In ICCV,
|
3552
|
+
| pages 1824–1831, 2005.
|
3553
|
+
blank |
|
3554
|
+
ref | [VAB10] Carlos A. Vanegas, Daniel G. Aliaga, and Bedrich Benes. Building
|
3555
|
+
| reconstruction using manhattan-world grammars. In CVPR, pages 358–
|
3556
|
+
| 365, 2010.
|
3557
|
+
blank |
|
3558
|
+
ref | [Vil03] C. Villani. Topics in Optimal Transportation. Graduate Studies in
|
3559
|
+
| Mathematics. American Mathematical Society, 2003.
|
3560
|
+
meta | BIBLIOGRAPHY 97
|
3561
|
+
blank |
|
3562
|
+
|
|
3563
|
+
|
|
3564
|
+
ref | [XLZ+ 10] Kai Xu, Honghua Li, Hao Zhang, Daniel Cohen-Or, Yueshan Xiong,
|
3565
|
+
| and Zhiquan Cheng. Style-content separation by anisotropic part scales.
|
3566
|
+
| ACM TOG (SIGGRAPH Asia), 29(5):184:1–184:10, 2010.
|
3567
|
+
blank |
|
3568
|
+
ref | [XS12] Yu Xiang and Silvio Savarese. Estimating the aspect layout of object
|
3569
|
+
| categories. In CVPR, pages 3410–3417, 2012.
|
3570
|
+
blank |
|
3571
|
+
ref | [XZZ+ 11] Kai Xu, Hanlin Zheng, Hao Zhang, Daniel Cohen-Or, , Ligang Liu, and
|
3572
|
+
| Yueshan Xiong. Photo-inspired model-driven 3D object modeling. ACM
|
3573
|
+
| TOG (SIGGRAPH), 30(4):80:1–80:10, 2011.
|
3574
|
+
blank |
|
3575
|
+
ref | [ZCC+ 12] Youyi Zheng, Xiang Chen, Ming-Ming Cheng, Kun Zhou, Shi-Min Hu,
|
3576
|
+
| and Niloy J. Mitra. Interactive images: Cuboid proxies for smart image
|
3577
|
+
| manipulation. ACM TOG (SIGGRAPH), 31(4):99:1–99:11, 2012.
|
3578
|
+
blank |
|